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Abstract 



Research into statistical parsing for English has enjoyed over a decade of 
successful results. However, adapting these models to other languages has met 
with difficulties. Previous comparative work has shown that Modern Arabic is one 
of the most difficult languages to parse due to rich morphology and free word 
order. Classical Arabic is the ancient form of Arabic, and is understudied in 
computational linguistics, relative to its worldwide reach as the language of the 
Quran. The thesis is based on seven publications that make significant 
contributions to knowledge relating to annotating and parsing Classical Arabic. 

Classical Arabic has been studied in depth by grammarians for over a thousand 
years using a traditional grammar known as i ’rab (M jf !). Using this grammar to 
develop a representation for parsing is challenging, as it describes syntax using a 
hybrid of phrase-structure and dependency relations. This work aims to advance 
the state-of-the-art for hybrid parsing by introducing a formal representation for 
annotation and a resource for machine learning. The main contributions are the 
first treebank for Classical Arabic and the first statistical dependency-based parser 
in any language for ellipsis, dropped pronouns and hybrid representations. 

A central argument of this thesis is that using a hybrid representation closely 
aligned to traditional grammar leads to improved parsing for Arabic. To test this 
hypothesis, two approaches are compared. As a reference, a pure dependency 
parser is adapted using graph transformations, resulting in an 87.47% Fl-score. 
This is compared to an integrated parsing model with an Fl-score of 89.03%, 
demonstrating that joint dependency-constituency parsing is better suited to 
Classical Arabic. 

The Quran was chosen for annotation as a large body of work exists providing 
detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in 
combination with expert supervision. A practical result of the annotation effort is 
the corpus website: http://corpus.quran.com , an educational resource with over 
two million users per year. 
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‘Glory be to thee! We have no knowledge except what you have taught us. 
Indeed it is you who is the all -knowing, the all-wise.’ 




A prayer of the angels 
-The Quran, verse (2:32) 
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Part I: Introduction and Background 




The worthwhile problems are the ones you can really 
solve or help solve, the ones you can really contribute 
something to... No problem is too small or too trivial if 
we can really do something about it. 

- Richard Feynman 



1 Introduction 

1 . 1 Motivation 

The topic of this thesis is statistical parsing for Classical Arabic using machine 
learning. This work includes constructing a formal grammatical representation 
and developing the Quranic Arabic Corpus as a dataset to test parsing algorithms. 

Parsing is the process of determining the syntactic structure of a sentence. 
Algorithms for parsing are researched in computational linguistics, an 
interdisciplinary field that combines computer science, statistical modelling and 
mathematical logic to process natural language. Analyzing the syntactic structure 
of a sentence through parsing can be a prerequisite step for deeper processing 
tasks such as machine translation (Huang et al., 2006; Zollmann and Venugopal, 
2006), semantic analysis (Carreras and Marquez, 2005) and task execution, in 
which machines execute physical tasks using natural language commands 
(Kuhlmann et al., 2004). 

My own motivation for developing a parser for Classical Arabic is that it is a 
less-studied language in computational linguistics. Classical Arabic is a 1,600 
year-old ancient language that is the direct ancestor of Modern Standard Arabic 
(MSA) spoken today. Although a variety of parsers exist for Modern Arabic, 
almost no previous work has been done for statistical parsing of Classical Arabic, 
the original language of the Quran. 
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1 - Introduction 



Figure 1.1 shows an example verse (ayah) from the Quran, written in Classical 
Arabic from right-to-left using a connected cursive script. Arabic, together with 
Hebrew, Turkish and Finnish are examples of languages that are morphologically 
rich and highly inflected. The complexity of these morphologically rich languages 
poses special challenges to parsing work. 



Figure 1.1: 

Verse (6:76) from 
the Quran. 




(6:76) When the night covered him, he saw a star. He said, ‘This is my 
Lord. ’ But when it set, he said, 7 do not love those that disappear. ’ 



The grammatical system explored in this thesis is i’rab }), a 1,000 year-old 
comprehensive linguistic theory that describes Classical Arabic’s phonology (the 
interaction of the units of sound that make up speech), morphology (the study of 
the substructure of words), syntax (the structure of sentences) and discourse 
analysis (the study of the discourse structures used in communication). This 
linguistic theory developed independently of Western thought and has influenced 
modern theories of syntax (Versteegh, 1997b; Baalbaki, 2008). For example, 
along with Panini’s Ashtaclhyayi for Classical Sanskrit, i’rab is considered to be 
one of the origins of modern dependency grammar (Kruijff, 2006; Owens, 1988). 

My motivation for this thesis originated in a personal interest in the linguistic 
structure of the Quran. Classical Arabic grammar is widely studied in the Islamic 
world due to the importance of the Quran, and several grammatical works exist 
that provide detailed analysis of its syntax (Salih, 2007; Darwish, 1996). I have 
often wondered if this analysis could be derived through statistical models using 
machine learning. Could algorithms learn from example data and reproduce the 
historical analyses of traditional grammarians? My interest in this idea led me to 
research statistical methods for parsing Classical Arabic, inspired by Arabic’s 
long and rich grammatical tradition. 
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1 - Introduction 



1.2 Research Questions 

1.2.1 Is Statistical Parsing Viable for Classical Arabic? 

Over the last two decades, statistical parsers have been used as an alternative to, 
and in combination with, previous rule-based parsers (Marcus et al., 1993; Abney, 
1996). In contrast to rule-based parsers, statistical parsers learn a grammatical 
model from a treebank - a syntactically annotated corpus of example sentences. A 
variety of methods are used for statistical parsing, ranging from maximum entropy 
techniques for phrase-structure representations (Chamiak, 2000) to support vector 
machines (SVMs) for dependency grammar (Nivre et al., 2007b). 

Most research into statistical parsing has focused on English, with the best 
models achieving up to 92% accuracy (McClosky, Charniak and Johnson, 2006). 
Adapting these parsing models to other languages has been less successful. For 
example, adapting Bikel’s parser to Chinese has resulted in an Fl-score of 79.9% 
(Chiang and Bikel, 2002). Similarly, results from the CoNFF shared task on 
multilingual dependency parsing show that Modern Arabic is one of the most 
challenging languages to parse (Nivre et al., 2007a). This is in part due to 
Arabic’s complex morphology. As noted by Soudi et al. (2007): 



The morphology of Arabic poses special challenges to computational 
natural language processing systems. The exceptional degree of ambiguity 
in the writing system, the rich morphology, and the highly complex word 
formation process of roots and patterns all contribute to making 
computational approaches to Arabic very challenging. 



It is thus not immediately obvious if parsing Classical Arabic is tractable using 
purely statistical methods. The primary research question that will be answered in 
this thesis is to determine whether or not statistical parsing for Classical Arabic is 
a viable approach for achieving state-of-the-art parsing accuracy. 
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1.2.2 Is a Hybrid Representation Suitable for Parsing? 

In modern linguistics, there is no universally accepted grammatical theory for 
representing syntactic information. Examples of different theories include 
transformational grammar (Chomsky, 1970), dependency grammar (Mel’cuk, 
1988), functional grammar (Halliday and Matthiessen, 2006) and combinatory 
categorial grammar (Steedman, 2000). For annotation, multiple representations 
can be used. The two main representations used by treebanks are constituency 
phrase- structure (using relations between clauses and their constituents), and 
dependency grammar (using dependency relations between words). This thesis 
describes a novel hybrid representation, combining aspects of both dependency 
and constituency syntax. The motivation for using a hybrid approach for Classical 
Arabic is to remain closely aligned to traditional analyses of Quranic grammar. 

This section introduces the hybrid representation by comparing to two existing 
representations. The following two diagrams annotate the same English sentence. 
Figure 1.2 is a constituency tree, with preterminal nodes annotated using an 
example POS (part-of-speech) tagset (PRON = pronoun, MOD = modal, NEG = 
negative particle, V = verb, PUNC = punctuation). Non-terminals are phrase tags 
(NP = noun phrase, VP = verb phrase, AD VP = adverb phrase, S = sentence). 



PUNC 




’ll never 



Figure 1.2: Phrase-structure parse tree using a simple grammar. 
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In contrast to the constituency approach, dependency theory represents sentence 
structure using binary dependencies between pairs of words. In Figure 1.3, the 
example sentence has been annotated using the same part-of-speech tags as Figure 
1.2, but using an alternative dependency tagset for syntax (subj = subject, obj = 
object, mod = modal, neg = negation). Unless otherwise stated, dependency 
diagrams in this thesis follow the convention of dependent nodes pointing to head 
nodes, the same convention used to annotate Classical Arabic in the Quranic 
Arabic Corpus. 1 

Although these two diagrams annotate an English sentence, they illustrate a task 
that is more challenging in Arabic - morphological segmentation. In the diagrams, 
terminal nodes are not words but segments of words. For example, the word 
‘you’ll’ has been segmented into the pronoun ‘you’ and the modal ‘will’. In 
English, only a minority of words such as contractions require segmentation for 
treebank construction. This contrasts with Arabic, where morphological analysis 
is complex, as many words require segmentation into multiple morphemes that 
each have different syntactic roles in sentence structure. 



You 


’ll 


never 


find 


it 


• 


• 


• 


• 


• • 


PRON 


MOD 


NEG 


V 


PRON PUNC 




Figure 1.3: Pure dependency graph for an English sentence. 



1 Appendix A describes the graph layout algorithm used to produce syntax diagrams in this 
thesis and for the online Quranic Treebank ( http://corpus.quran.com/treebank.isp') . 
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The previous diagrams illustrated two different representations for syntactic 
annotation. For parsing, the choice of representation used to model a language is 
fundamental to the operation of a parser. It constrains possible parsing algorithms 
and has a direct effect on parsing accuracy. This is highlighted by the recent use 
of model adaptation, where existing statistical parsers designed for English have 
been retrained for Modern Arabic (Green and Manning, 2010). Because Arabic 
contains linguistic constructions not found in English, this has resulted in parsing 
underperformance (described further in section 2.4). 



Jij 3^ 

He said, ‘ This is my Lord .’ 



Figure 1.4: Extract from verse (6:76). 



In this thesis, Classical Arabic syntax will be described using an alternative 
representation based on Arabic’s grammatical tradition. However, despite its 
prominence in Arabic linguistic works, the grammatical rules of i’rab have 
previously lacked a formal representation, making computational modelling of 
Classical Arabic grammar challenging. In contrast to formal methods, traditional 
analysis is described by grammarians through prose. For example, the syntax of 
verse (6:76) shown in Figure 1.4 is described by Salih (2007) using the following 
analysis (translated from Arabic): 



In this verse, ‘said’ is a perfective verb, whose subject is a dropped pronoun 
of the form ‘he’. The noun ‘lord’ is in the nominative case and is the 
predicate of the demonstrative pronoun ‘this’. The suffixed pronoun ‘my’ 
attached to the noun is a possessive clitic. The nominal sentence, headed by 
the demonstrative pronoun, is governed by the verb ‘said’ as a direct object. 
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Figure 1.5: Hybrid 


(6:76:9) 


(6:76:8) 




(6:76:7) 


dependency-constituency 


rabbi 


hadha 




qala 


graph. 


(is) my Lord. ’ 


‘This 




(He) said, 




JJ 

d>j 




(» 






• • 

PRON N 


• 

DEM 


• 

PRON 


• 

V 




A hybrid representation can be used to formalize this analysis. For example, 
Salih analyses the phrase ‘This is my Lord’ as a dependency of the verb ‘said’. 
Although i’rab describes dependencies between morphological segments, this 
shows that the grammar also describes dependencies between words and phrases. 
Arabic grammatical theory could be interpreted as either a pure dependency or 
constituency representation, but a hybrid representation more closely aligns to 
traditional analysis. Figure 1.5 annotates verse (6:76) of the Quran using the 
hybrid formalism that will be presented in Chapter 6. The diagram shows a graph 
with nodes that are either morphological segments with part-of- speech tags (V = 
Verb, PRON = Pronoun, DEM = Demonstrative, N = Noun) or phrase nodes (NS 
= Nominal Sentence). Edges are tagged with dependency relations such as object, 
subject and predicate, shown in Arabic using traditional terminology. 

The second research question addressed in this thesis is to determine if a hybrid 
dependency-constituency representation is better suited to parsing Classical 
Arabic compared to a pure dependency representation. This question will be 
answered by annotating the Quran using the hybrid representation and comparing 
the two approaches to parsing. 
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1.2.3 Can Crowdsourcing be used for Annotating Arabic? 

Of potential wider interest beyond Classical Arabic parsing is the use of 
crowdsourcing to construct the annotated treebank which will be used to train a 
statistical parser. Statistical parsers require high-quality training data in the form 
of sentences annotated according to a chosen syntactic representation. A typical 
annotation methodology involves paid experts who perform offline annotation. 
However, the alternative of online collaboration has recently emerged as a viable 
alternative to more conventional approaches for developing tagged corpora 
(Chamberlain et al., 2009). Online collaboration has been used for a wide variety 
of linguistic tagging tasks ranging from named-entity resolution of international 
hotels (Su et al., 2007) to syntactic annotation of Latin and Ancient Greek texts 
(Bamman et al., 2009). 

In this thesis, crowdsourcing will be used to develop the first treebank for 
Classical Arabic. Following initial automatic tagging, the main task that volunteer 
annotators are asked to perform is to proofread morphological and syntactic 
annotation. Annotators verify this against gold standard analyses from Arabic 
reference works of Quranic grammar. Although the reference material contains 
equivalent grammatical information, because its content is unstructured prose that 
is not easily machine readable, a manual cross-checking stage is required. 

The third research question to be investigated in this thesis is to determine if a 
form of crowdsourcing can be used as an annotation methodology for producing 
high-quality tagging of Classical Arabic. Volunteer crowdsourcing can be cost 
effective, but consistency and accuracy need to be ensured if the data is to be used 
for statistical modelling. In the Quranic Arabic Corpus, expert annotators are 
promoted to a supervisory role, reviewing and discussing the work of others 
online using an interactive message board forum. In this thesis, the collaborative 
annotation methodology will be compared to the alternative of crowdsourcing 
without expert supervision, and evaluated for accuracy. 
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1.3 Original Contributions of the Thesis 

1.3.1 Theoretical Contributions 

The main theoretical contributions that will be presented in the thesis are: 

• The first formalism of i ’mb and the first morphosyntactic annotation 
scheme for Classical Arabic. This includes a novel hybrid dependency- 
constituency representation, with a fine-grained tagset for parts-of-speech 
and phrases, morphological features and dependency relations. 

• The first evaluation of a methodology for online supervised collaboration 
for Arabic annotation. This methodology combines crowdsourcing with 
expert supervision to produce highly-quality annotation for Arabic text. 

1.3.2 Practical Contributions 

The main practical contributions to be presented are: 

• The first treebank for Classical Arabic. This includes manually-verified 
morphological annotation for 77.4K words tagged with 783K feature- 
values together with syntactic tagging for 37. 6K words. Supplementary 
annotation includes named-entity tagging, an ontology of concepts, a 
word-by-word English translation and a morphological lexicon. 

• The first web-based platform for capturing, editing and visualizing Arabic 
morphosyntactic annotations online. This includes a comprehensive set of 
supplementary linguistic tools to access and search corpus annotations. 

• The first statistical parser for Classical Arabic. In addition, this is also the 
first dependency-based statistical parser in any language that handles 
elliptical structures, dropped pronouns and a hybrid representation. 
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1.4 Thesis Outline 

This thesis is divided into five parts with 12 chapters, shown in Figure 1.6 below: 



Part I: Introduction and Background 

1 Introduction 

2 Literature Review 

3 Historical Background 
Part II: Modelling Classical Arabic 

4 Orthographic Representation 

5 Morphological Representation 

6 Syntactic Representation 

Part III: Developing the Quranic Arabic Corpus 

7 Annotation Methodology 

8 Annotation Platform 
Part IV : Statistical Parsing 

9 Hybrid Parsing Algorithms 

10 Machine Learning Experiments 
Part V: Further Work and Conclusion 

1 1 Uses of the Quranic Arabic Corpus 

12 Contributions and Future Work 



Figure 1.6: Organization of thesis chapters. 



Part I provides relevant background information. Following this introductory 
chapter. Chapter 2 contains the literature review, discussing Arabic treebanks and 
annotation methodologies. Recent morphological analyzers and statistical parsers 
for Arabic are also compared. Relevant historical background on the Arabic 
linguistic tradition is provided in Chapter 3. 
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Part II presents a formal model of Classical Arabic, with a representation for 
orthography (Chapter 4), morphology (Chapter 5) and syntax (Chapter 6). The 
representation is presented both as a well-defined set-theoretic description and as 
an annotation scheme. 

Part III describes the development of the Quranic Arabic Corpus. Chapter 7 
discusses the annotation methodology of supervised collaboration. Chapter 8 
describes the web-based software platform used to capture annotations online and 
the supplementary linguistic tools developed for annotators. 

Part IV focuses on statistical parsing. In Chapter 9, two algorithms for hybrid 
parsing are compared: a multi-step process using graph transformations and a 
novel one-step algorithm without post-processing. Chapter 10 evaluates the parser 
using statistical models induced from the treebank by machine learning. A series 
of experiments consider the effect of using different morphological features for 
parsing and the results are compared to recent parsing work for Modern Arabic. 

Part V concludes the thesis. Chapter 11 describes recent research that has made 
use of the annotations in the Quranic Arabic Corpus and Chapter 12 summarizes 
the main contributions and presents recommendations for future research. The last 
chapter concludes with a discussion of the challenges and limitations of the work 
as well as its implications for theoretical and computational linguistics. 
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Perhaps the central problem we face in all of computer 
science is how we are to get to the situation where we 
build on top of the work of others... Science is supposed 
to be cumulative. 

- Richard Hamming 



2 Literature Review 

2.1 Introduction 

Arabic is a major world language. Together with Chinese, English, French, 
Russian and Spanish, it is one of the six official languages of the United Nations. 
Including its literary form and its various dialects, it is the first language for 280 
million native speakers across the Middle East and North Africa (Prochazka, 
2006). Classical Arabic is the liturgical language of prayer and worship for the 
world’s Muslim population, estimated at between 1.57 billion (Lugo, 2009) and 
1.65 billion people (Kettani, 2010), up to a quarter of the world’s population. 

Arabic has recently become the focus of an increasing number of natural 
language processing projects (Habash, 2010). This review describes relevant work 
in four areas: morphology, syntax, parsing and annotation methodologies. The 
first part of the review describes recent work for Arabic morphology, including an 
analysis of the limitations of previous morphological work for the Quran. To 
provide context for the syntactic representation developed for the Quranic Arabic 
Corpus, the review compares the Penn, Prague and Columbia Arabic treebanks, 
focusing on the approaches used to formalize Arabic syntax. 

Following the description of morphological and syntactic projects, parsing work 
for Arabic is reviewed, describing how different syntactic representations affect 
accuracy. Attention is also given to dual dependency-constituency parsing work 
for German and Swedish, as these methods are relevant to the hybrid parsing work 
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described in Chapter 9. Models for ellipsis are also reviewed, which are often 
ignored in parsing work but are developed in this thesis. The review of parsing 
work concludes with a discussion of recent work for Hebrew. This related Semitic 
language presents similar challenges to statistical parsing, and illustrates recent 
trends in parsing that are also applicable to Arabic. 

Methodologies for other relevant annotation projects beyond Arabic are also 
reviewed, comparing offline expert annotation to collaborative online annotation 
and crowdsourcing. Finally, the conclusion summarizes the implications of the 
reviewed work in relation to the thesis research questions. 

2.2 Arabic Morphological Analysis 

This section of the review discusses different approaches to Arabic computational 
morphology. Morphological analysis tasks for Arabic include segmentation (the 
division of compound word-forms into prefixes, stems and suffixes), part-of- 
speech tagging (assigning a tag to each morphological segment), lemmatization 
(assigning lemmas to stems) and the identification of the roots and patterns used 
in inflected Arabic word-forms. 



2.2.1 The Buckwalter Arabic Morphological Analyzer 

The Buckwalter Arabic Morphological Analyzer (BAMA) is a freely available 
rule-based morphological analyzer, developed to perform initial tagging of Penn 
Arabic Treebank (Buckwalter, 2002). This previous work is relevant because an 
analyzer based on BAMA’s algorithm will be used in Chapter 7 to perform initial 
morphological tagging for the Quranic Arabic Corpus. 

BAMA’s analysis algorithm depends on its lexicon. Version 2.0 of the analyzer 
contains 78,839 lexical entries representing 40,219 lemmas. This data is organized 
into segment tables with entries for prefixes, stems and suffixes, and compatibility 
tables listing permitted combinations of segments. The part-of-speech tagset used 
in these dictionary files is the same as that used for the Penn Arabic Treebank. 
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The morphological analyzer processes undiacritized Arabic text, returning 
several possible analyses for each word. Its analysis algorithm generates all 
possible segmentations into prefixes, stems and suffixes. For each combination, 
the segment tables are checked to determine if the analysis is linguistically 
plausible. The resulting filtered analyses are output with full diacritization and 
morphological annotation, augmented by features from the lexicon. 

BAMA is widely used by the Arabic computational research community for a 
variety of tasks including diacritic restoration (Ananthakrishnan et al., 2005), 
automatic speech recognition and machine translation (Soltau et al., 2007) and 
named entity recognition (Farber et al., 2008). Its lexicon has also been used as 
one source of data for the Arabic version of Google’s online translation service. 
However, BAMA is limited by producing multiple analyses for each word. To 
overcome this limitation, BAMA’s lexicon has been used as the basis for more 
sophisticated statistical disambiguation systems, described in the next section. 



2.2.2 Lexeme and Feature Representations 

Habash (2007a) notes that Arabic morphological resources use different, often 
incompatible, representations to model morphology. Electronic dictionaries and 
lexicons are based around headwords and lemmas. Stemmers focus on extracting 
the stems of word-forms and deeper analyzers extract roots and patterns. Habash 
proposes a lexeme-plus-feature representation to relate these different resources. 
This work is relevant to Classical Arabic because the Quranic Arabic Corpus uses 
a similar representation for morphological annotation, as described in Chapter 5. 

For morphologically-rich languages such as Arabic, the term lexeme is used to 
denote an abstract grouping of words that share the same base meaning, but differ 
through inflection. A lemma, also known as a citation form, is a conventional 
choice of one word that represents a lexeme. Dictionary entries are usually 
organized by lemma. For example, in English the set of words ‘eat’, ‘eats’, ‘ate’ 
and ‘eating’ form a lexeme, with ‘eat’ as the lemma. 
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Feature 


Value 


Definition 


Part of Speech 


POS:N 


Noun 


POS:PN 


Proper Noun 


POS:V 


Verb 


POS:AJ 


Adjective 


POS:AV 


Adverb 


POS:PRO 


Pronoun 


POS:P and others 


Preposition 


Conjunction 


w+ 


‘ and ’ 


f+ 


‘ and ‘so ’ 


Preposition 


b+ 


‘by ‘with ’ 


k+ 


‘like ’ 


1+ 


‘for \ ‘to ’ 


Verbal Particle 


s+ 


‘will ’ 


1+ 


‘so as to ’ 


Definite Article 


A1+ 


‘the’ 


Verb Aspect 


PV 


Perfective 


IV 


Imperfective 


cv 


Imperative 


Voice 


PASS 


Passive 


Gender 


FEM 


Feminine 


MASC 


Masculine 


Subject 


S:PerGenNum 


Person = { 1, 2, 3} 


Object 


0:PerGenNum 


Gender = {M, F} 


Possessive 


P:PerGenNum 


Number = {S, D, P} 


Mood 


MOOD:I 


Indicative 


MOOD:S 


Subjunctive 


MOOD:J 


Jussive 


Number 


SG 


Singular 


DU 


Dual 


PL 


Plural 


Case 


NOM 


Nominative 


ACC 


Accusative 


GEN 


Genitive 


Definiteness 


INDEF 


Indefinite 


Possession 


POSS 


Possessed 



Table 2.1: Features used in ALMORGEANA’s morphological representation. 
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The ALMORGEANA system described by Habash (2007a) uses lexemes and 
features to provide bidirectional morphological analysis and generation, suitable 
for a variety of processing tasks, such as machine translation. The system utili z es 
a lexicon based on dictionary data from BAMA, but applies a different algorithm 
to perform morphological processing. In ALMORGEANA, the BAMA segment 
tables are converted to the lexeme-plus-feature representation. Table 2.1 (page 16) 
lists the converted morphological features. Figure 2.1 below illustrates how these 
features are used to represent the morphology of the compound Arabic word-form 
lilkutubi (translated as ‘/or the books’). 



[kitAb_l POS:N PL A1+ 1+] 



‘/or the books’ 



Figure 2.1: Lexeme-plus-feature representation for an Arabic word. 



The lexeme for this surface form is represented by the lemma kitab, displayed 
using Buckw alter transliteration as kitAb_l. The suffix _1 is part of a numbering 
scheme used to distinguish word senses with the same name. Four features follow 
the lemma. POS:N is the part-of-speech tag for nouns, and PL denotes a plural 
word. A1+ indicates that the word-form has the Arabic al- prefix to denote 
definiteness (‘the’), and 1+ indicates the lam prefixed preposition (‘for’). 

Like the Buckwalter analyzer, ALMORGEANA outputs several possible 
morphological analyses for each input Arabic word. Habash and Rambow (2005) 
extend the system to select a statistically most-probable analysis. Using data from 
the Penn Arabic Treebank converted to the lexeme-plus-feature representation, 
they build a statistical model to rank possible analyses using support vector 
machines trained to recognize individual morphological features. Testing against 



17 




2 - Literature Review 



the Penn Treebank, they report high accuracy scores of 99.3% for morphological 
segmentation at word-level, and 98.1% for part-of-speech tagging over all tokens, 
using a reduced tagset. 

Based on this work, Habash, Rambow and Roth (2009b) describe a toolkit 
consisting of two Arabic morphological systems, MADA and TOKAN. Like 
ALMORGEANA, the toolkit utilizes the BAMA lexicon. MADA (Morphological 
Analysis and Disambiguation for Arabic) is a statistical morphological analyzer 
that selects the best possible BAMA analysis using weighted predicted features. 
TOKAN is a flexible Arabic tokenizer that provides morphological segmentation 
of Arabic words according to a number of possible tokenization schemes. The 
toolkit has been used for a variety of further work including English-to-Arabic 
machine translation (Badr et al., 2008) and named entity recognition (Farber et al., 
2008; Benajiba et al., 2008). 

Compared to the Buckw alter Analyzer, this toolkit is attractive because it 
produces a single morphological analysis for each Arabic word. The use of a 
lexeme -plus-feature representation is notable for providing a computational model 
of Arabic morphology that is flexible enough to support different processing 
tasks. This representation will be extended to Classical Arabic morphology in 
Chapter 5. 



2.2.3 Fine-Grained Morphological Analysis 

In contrast to previous work, the SALMA tagger (Standard Arabic Language 
Morphological Analysis) uses a more fine-grained morphological tagset based on 
concepts from the Arabic linguistic tradition (Sawalha and Atwell, 2010; Sawalha, 
Atwell and Abushariah, 2013). This work compares to the annotation presented in 
this thesis, which is also fine-grained. 

The SALMA tagger utilizes a lexicon of inflected surface forms containing 2.7 
million vowelized word-root pairs, built by combining 23 Arabic dictionaries. 
Arabic text is annotated using a set of 22 morphological features that include part- 
of-speech, gender, number, person, case, mood, definiteness, voice, emphasis, 
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transitivity, variability, roots and verb structure. The tagging algorithm segments 
words by applying a sequence of regular expressions to produce a list of candidate 
analyses. Segmented stems are matched to the lexicon to extract possible roots. A 
pattern database consisting of 2,730 patterns for verbs and 985 for nouns is used 
to search for appropriate root-pattern pairs. Morphological features are then 
annotated using the lexicon. 

Sawalha et al. (2013) measure the tagger’s accuracy by manually annotating a 
gold-standard dataset of 2,000 words using samples from two corpora. For 
Classical Arabic, they annotate the morphological analysis of the Quran by Dror 
et al. (2004), described in the next section. For Modem Arabic they use data from 
the Corpus of Contemporary Arabic (Al-Sulaiti and Atwell, 2006). For a set of 15 
morphological features, they report an estimated accuracy score of 98.53% for 
tagging Modern Arabic and 90.1% for Classical Arabic. 

This work demonstrates that automatic fine-grained morphological analysis of 
Arabic is possible. The morphological representation in Chapter 5 will also use a 
fine-grained tagset based on traditional grammar. It differs by using an alternative 
set of tags with morphological features developed specifically for Classical Arabic 
and designed to integrate with a syntactic representation. 



2.2.4 Finite State Morphological Analysis of the Quran 

This section describes the use of Finite State Machines (FSMs) to annotate the 
Arabic morphology of the Quran (Dror et al., 2004). To the best of the author’s 
knowledge, this work is the only other wide-coverage computational analysis of 
Classical Arabic morphology, before the new work presented in this thesis. 
However, unlike the Quranic Arabic Corpus, the FSM analysis has not been 
manually verified by expert annotators. Dror et al. provide several different 
possible analyses for each word in the Quran, but do not disambiguate these to 
bring their annotations up to gold-standard level. 

Their approach uses finite state computing using FSMs. These are abstract 
mathematical models of computation that consist of multiple states, together with 
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rules that determine transitions between states. They have been applied to a wide 
variety of morphologically-rich languages, for which lexicons and morphological 
rules are developed manually by linguistic experts and encoded as state transition 
(Roche and Schabes, 1997; Beesley and Karttunen, 2002). The output of FSM 
systems are typically in a lexeme-plus-feature representation. In the description of 
their system for Classical Arabic, Dror et al. note that the language of the Quran 
remains relatively unexplored in contrast to Modern Arabic: 



Except for isolated efforts, little has been done with computer-assisted 
analysis of the text. Thus, for the present, computer-assisted analysis of the 
Quran remains an intriguing but unexplored field. 



Their FSM analysis utilizes a new morphological lexicon based on the Quranic 
concordance by Abdalbaqi (1987). The lexicon associates lexemes with roots and 
patterns, and consists of 2,500 noun-forms, 100,000 possible verb bases and 
several hundred closed-class words. The verb bases were generated automatically 
by applying a list of Arabic word patterns to the roots in the Quran. As a result, 
most of the verbs bases in the lexicon do not occur in the text. To perform 
morphological analysis, an FSM consisting of approximately 300 hand-written 
rules for verbs and 50 rules for nouns are used to generate a list of possible 
analyses for each word in the Quran. In their evaluation, Dror et al. note that they 
do not perform full morphological disambiguation to select a single analysis for 
each word. However, by performing manual verification on a 1,250 word sample 
of the Quran, they estimate that 86% of words have a correct morphological 
analysis in the list of possible outputs produced by their analyzer. 

This work is notable for being the first automatic morphological analysis of the 
Quranic text. However, their analysis has three limitations. Without manual 
correction, the annotations cannot be considered to be of gold-standard. Secondly, 
the Classical Arabic script of the Quran is not used, which makes it difficult to 
relate their work to other Arabic computational resources. Instead a phonetic 
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transcription into the English alphabet is used as their orthographic representation. 
Thirdly, they do not publish a well-defined annotation scheme. Although they 
provide example output for their analyzer, they do not fully describe their tagset 
or list their set of morphological features. However, this could be inferred by 
processing their annotations to build up a list of possible tags. These limitations 
will be addressed in this thesis by providing manually- verified annotation using a 
well-defined morphosyntactic representation. To address the limitations with their 
approach to orthography, a new orthographic representation for Classical Arabic 
script that is convertible to Unicode will be presented in Chapter 4. 

2.3 Arabic Syntactic Treebanks 

Over the last several decades, the development and use of annotated corpora has 
grown to become a major focus of research for both linguistics and computational 
natural language processing. Corpora provide the empirical evidence that is used 
to advance various theories of language (Sampson and McCarthy, 2005). They are 
also used by computational linguists to engineer state-of-the-art natural language 
systems and resources such as electronic lexicons (Hajic et al., 2003; Kucera and 
Francis, 1967) and part-of-speech taggers (Brants, 2000a; Spoustova et al., 2009; 
Spgaard, 2011). Treebanks are annotated corpora that include morphological and 
syntactic annotation. This section reviews previous work for developing the three 
major treebanks for Arabic: The Penn, Prague and Columbia Arabic treebanks. 

2.3.1 The Penn Arabic Treebank 

The Penn English Treebank (Marcus, Santorini and Marcinkiewicz, 1993) was the 
first large-scale syntactic annotation project in any language, and helped introduce 
an alternative methodology for parser construction. Parsers that had previously 
been developed using hand-written grammatical rules were supplemented by 
parsers using statistical models induced from treebank data (Collins, 1999; 
Chamiak, 2000; Nivre et al., 2007b). Over the last two decades, the Penn 
Treebank has remained one of the standard datasets for benchmarking English 
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parsing, with state-of-the-art statistical parsers achieving FI -scores of 90-92% 
against Penn Treebank data. 

For Modem Arabic, The Penn Arabic Treebank (Maamouri et al., 2004) is a 
related project designed to support the development of data-driven morphological 
analyzers and syntactic parsers. This project is important as it is the first treebank 
for the Arabic language. It uses the same constituency representation as the 
English Treebank, with the same tags used to annotate phrase structure. Maamouri 
et al. (2004) argue that using the English tagset for Arabic makes it easier to train 
annotators and that existing linguistic tools for English can be reused, simplifying 
the annotation process. 

However, after the initial release of the treebank several constituency parsers 
previously developed for English were adapted to Arabic. Compared to English, 
the Arabic Treebank has been found to be more challenging to parse, with parsers 
achieving lower Fl-scores of 74-83%. Recent work has shown that the treebank’s 
choice of constituency representation has affected both parsing accuracy and 
annotation consistency (Kulick et al., 2006; Green and Manning, 2010). Section 
2.4 reviews this parsing work and describes the causes of underperformance. 

Figure 2.2 (overleaf) shows an example tree from the Penn Arabic Treebank 
annotated using constituency syntax. As per the annotation guidelines (Bies and 
Maamouri, 2003), this tree is shown in bracketed form and annotates a sentence 
that is a single Arabic verb stem with attached clitics. The word- form has been 
segmented into four morphemes shown both in Arabic script and Buckwalter 
Transliteration. In the parse tree, the tags are the same as that used for the Penn 
English Treebank (S = sentence, VP = verb phrase, PRT = particle, NP-SBJ = 
noun phrase / subject, NP-OBJ = noun phrase / object). The tree also contains an 
empty category denoted by an asterisk (*). In Arabic, the subjects of verbs are 
often dropped pronouns and are implied by the verb’s morphological inflection 
features. In comparison to the work in this thesis, the Penn Arabic Treebank is the 
only other Arabic resource to annotate elliptical structure. 
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(S wa- j 

(VP ( PRT -sa- 

-tu+$Ahid+uwna- 
(NP-SBJ *) 

(NP-OBJ -hA lA) ) ) 






‘ant/ you will observe her ’ 



Figure 2.2: Constituency tree from the Penn Arabic Treebank. 



The first version of the Penn Arabic Treebank was annotated over a three year 
period using a two-stage process. The first stage is morphological annotation, 
where each sentence is processed using BAMA (described previously in section 
2.2.1), to produce a list of possible morphological segmentations with part-of- 
speech tags, lemmas and morphological features for each word. Following 
automatic tagging, morphological annotation is manually corrected by paid 
linguistic experts who select the most suitable analysis from the list of available 
possibilities. The second stage is syntactic annotation. Bikel’s parser is used to 
generate a constituency tree for each sentence using the reviewed morphological 
annotation (Bikel, 2004a). The constituency trees are then reviewed and corrected 
by annotators. Using this two-stage process, the initial release of the treebank 
contained morphosyntactic annotation for approximately half a million words of 
Arabic (Maamouri et al., 2004). 

For newer versions of the Penn Arabic Treebank, Maamouri et al. (2008) have 
suggested changes to the annotation scheme to improve parsing accuracy. They 
note that annotation inconsistencies in the Arabic treebank arise when expert 
annotators, who are familiar with traditional Arabic grammar and concepts from 
i’rab, attempt to interpret their analyses using an annotation scheme originally 
designed for English. They propose a revised set of guidelines that include new 
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tags to better represent the fine-grained distinctions of Arabic syntax. These 
changes align the tagset more closely to traditional concepts already familiar to 
annotators, such as the traditional categorization of nominals and particles. This 
compares to the work presented in this thesis, which uses a tagging scheme based 
on traditional grammar, but using an alternative hybrid syntactic representation. In 
contrast, the new guidelines for the Penn Arabic Treebank fall short of suggesting 
any changes to the syntactic representation, which remains constituency-based, 
despite the accuracy limitations this imposes on Arabic parsing. 



2.3.2 The Prague Arabic Treebank 

The syntactic representation to be presented in Chapter 6 is a dependency-based 
hybrid that includes aspects of constituency syntax. This compares to the second 
major Arabic treebank to be released after the Penn Treebank, the Prague Arabic 
Treebank (Hajic et al., 2004; Smrz and Hajic, 2006). This treebank uses a pure 
dependency representation and annotates the same source text as the Penn 
Treebank - collections of Arabic news articles distributed by the Linguistic Data 
Consortium (LDC). 

The Prague Arabic Treebank shares its grammatical framework with the Prague 
Czech Treebank (Hajic, Hladka and Pajas, 2001), and focuses on three levels of 
annotation: morphological, analytical (surface syntax) and tectogrammatical (deep 
syntax and linguistic meaning). The first version of the treebank, published in 
2004, contains morphological annotation for 148,000 words and syntactic 
annotation for 113,500 words, with tectogrammatical annotation still under 
development at the time of its publication. 

The grammatical framework used for the Prague Treebank is the Functional 
Generative Description (Sgall, Hajicova and Panevova, 1986; Hajicova and Sgall, 
2003). This is a dependency-based representation that emphasizes the difference 
between form (including word-forms and morphological realizations) and 
function (such as the syntactic roles of subject, object and predicate). This 
grammatical description was originally designed for Czech, a language that is 
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morphologically rich, possessing a high degree of free word order. Both of these 
aspects of Czech are also found in Arabic. The authors of the treebank argue that 
using a dependency representation has resulted in annotations better suited to 
Arabic’s linguistic constructions, compared to the constituency representation 
used for the Penn Treebank. Smrz and Hajic (2006) note the similarities between 
their dependency representation and the Arabic linguistic tradition: 



Not only are the notions of dependency and function central to many 
modern linguistic theories and ‘inherent’ to computer science and logic, 
their connection to the study of the Arabic language and its meaning is 
interesting too, as the traditional literature on these topics, with some works 
dating back more than a thousand years, actually involved and developed 
similar concepts. 



Hajic et al. (2004) describe the annotation methodology used to develop the 
treebank as multi-staged. Initial morphological tagging was performed by a data- 
driven maximum entropy tagger that was previously developed for Czech (Hajic 
and Hladka, 1998). This tagger was adapted to Arabic through retraining by using 
morphological data from the Penn Arabic Treebank. They report a 10.8% error 
rate for tagging parts-of-speech, but only a 0.8% error rate for segmentation of 
Arabic words into constituent morphemes. 

Following automatic tagging, expert annotators corrected the morphological 
analysis and manually added syntactic annotation. Once an initial section of the 
treebank was completed, a syntactic parser was trained on the annotated data in 
order to automatically parse the remainder of the corpus. The resulting 
dependency trees were then manually corrected by annotators. 

Figure 2.3 (overleaf) shows an example tree from the Prague Arabic Treebank. 
Individual Arabic words have been morphologically segmented into morphemes, 
with one morpheme annotated per line. The first line is reserved for the abstract 
root of the dependency tree. This differs from other dependency treebanks, such 
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as the Columbia Arabic Treebank, in which all nodes including the root node 
correspond to morphemes (Habash and Roth, 2009c). 

The diagram is organized into four columns. Reading from left-to-right, the first 
column contains the dependency tree. The tree’s nodes are morphemes and the 
tree’s edges are labelled with syntactic roles. The syntactic tags shown in the 
diagram are the same as those found in the Czech Treebank (AuxS = Root Node, 
AuxY = Adverbial Particle, AuxP = Preposition, Adv = Adverb, Atr = Attribute, 
Pred = Predicate, Sb = Subject, Obj = Object, Coord = Coordination, AuxK = 
Punctuation). This approach is similar to the Penn Arabic Treebank, which also 
does not use traditional Arabic grammar for its syntactic tags, but instead reuses 
an annotation scheme for another language. The second column shows surface 
forms, displayed using both Arabic script and a phonetic English transcription. 
The third column is a gloss for each morphological segment. Finally, the fourth 
column displays morphological tagging using positional notation. The positions 
are slots for major and minor parts of speech, mood, voice, person, gender, 
number, case and state features. Unset values are indicated by dashes (-). For 

example, the Arabic word for ‘the magazine’ is tagged as N FS1D, denoting a 

feminine singular noun in the nominative case with definite state. 

Since its initial release, the treebank has been extended with morphological 
annotation for 393,000 words, syntactic annotation for 125,000 words and 
tectogrammatical annotation for 10,000 words. Data from this extended version of 
the treebank was used in the CoNFF shared task on multilingual dependency 
parsing to benchmark the performance of several Arabic statistical parsers (Nivre 
et al., 2007a). This parsing work is reviewed in section 2.4.3. 
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aPadabi the-literature N 2D 
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al - c arabiyati the- Arabic A FS2D 
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G 



• ^vJ 1 I AjdJ I A. .ta? aJi^J3 



7/t the section on literature, the magazine presented the issue of the Arabic 
language and the dangers that threaten it. ’ 

Figure 2.3: Dependency tree from the Prague Arabic Treebank. 
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2.3.3 The Columbia Arabic Treebank 

The Columbia Arabic Treebank (CATiB) is the third major syntactic treebank for 
Arabic (Habash, Faraj and Roth, 2009a; Habash and Roth, 2009c). The treebank is 
designed to facilitate the development of statistical parsers for Modem Arabic. 
Like the Prague Treebank, the Columbia Treebank is also annotated using 
dependency grammar. However, the Columbia Treebank contrasts with both the 
Penn and Prague treebanks by adopting a minimalistic syntactic representation. 
The methodology for treebank construction focuses on rapid annotation using a 
smaller number of tags, allowing annotators to correct large amounts of text as 
quickly as possible. The treebank’s tagset has six part-of-speech tags, shown in 
Table 2.2 below: 



Part-of-Speech Tag 


Meaning 


NOM 


Nominals (nouns, pronouns, adjectives and adverbs) 


PROP 


Proper nouns 


VRB 


Verbs 


VRB-PASS 


Passive- voice verbs 


PRT 


Particles (including prepositions and conjunctions) 


PNX 


Punctuation 



Table 2.2: Part-of-speech tags in the Columbia Arabic Treebank. 



Similarly, the dependency tagset is also minimal with only seven tags (Table 
2.3, overleaf). With the exception of the modifier tag (MOD), the dependency 
relations are based on well-known traditional syntactic roles. These tags are easily 
understandable by expert annotators familiar with traditional Arabic grammar. 
The annotation scheme purposely excludes additional relations used for deep 
tagging, such as the functional tags for time and place in the Penn Treebank. 
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Dependency Tag 


Meaning 


SBJ 


Subject 


OBJ 


Object 


TPC 


Topic 


PRD 


Predicate 


IDF 


Possessive (idafa) 


TMZ 


Specification (tamyiz) 


MOD 


Modifier 



Table 2.3: Dependency tags in the Columbia Arabic Treebank. 



Habash et al. (2009a) emphasize that basing their scheme on concepts from the 
Arabic linguistic tradition simplifies the annotation process. This compares to the 
approach used for the Quranic Arabic Corpus, which also uses a tagset based on 
traditional grammar, but utilizes a more fine-grained set of tags: 



CATiB uses a linguistic representation and terminology inspired by 
Arabic’s long tradition of syntactic studies. This makes it easier to train 
annotators without being restricted to hire annotators who have degrees in 
linguistics. CATiB uses an intuitive dependency representation and 
relational labels inspired by Arabic grammar such as tamyiz (specification) 
and idafa (possessive construction) in addition to universal predicate- 
argument structure labels such as subject, object and modifier. 



The initial version of the treebank provided morphological and syntactic 
annotation for 200,000 words of Arabic, annotated rapidly over five months. The 
annotator training period was only two months, compared to between six months 
to a year for the Penn and Prague Arabic treebanks (Habash and Roth, 2009c). 
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‘50 thousand tourists visited Lebanon last September. ’ 



Figure 2.4: Constituency tree from the Penn Arabic Treebank (upper tree) 
and a dependency tree from the Columbia Arabic Treebank (lower tree). 



30 



2 - Literature Review 



As with previous treebanks, the annotation methodology proceeds in multiple 
stages. In the first stage, the text is part-of- speech tagged and morphologically 
segmented using the MADA+TOKAN toolkit (Habash and Rambow, 2005). The 
FI accuracy scores reported for these two morphological processing tasks is 
99.7% and 97.7% respectively. The automatically tagged data is corrected by 
annotators. Following morphological annotation, initial dependency parsing was 
performed using MaltParser (Nivre et al., 2007) and then manually reviewed. The 
parser was trained using data from the Penn Arabic Treebank by automatically 
converting constituency trees into dependency trees. Following completion of the 
first section of the treebank, the parser’s statistical model was improved by 
retraining using the additional annotated data. 

To illustrate the differences in representation between in the Penn Treebank and 
the Columbia Treebank, Figure 2.4 (page 30) shows the same Arabic sentence 
annotated using both schemes. The upper tree in the diagram uses Penn Treebank- 
style constituency annotation. The lower tree is a dependency tree from the 
Columbia Treebank. Similar to the Prague Treebank, this tree has nodes which are 
morphological segments and edges labelled with syntactic dependency roles. 

Work for developing the Columbia Arabic treebank demonstrates that high- 
quality morphosyntactic annotation of Arabic is possible using an annotation 
scheme based on concepts from traditional Arabic grammar. Compared to the 
Penn Arabic Treebank, Habash et al. (2009c) report higher inter-annotator 
agreement for morphological and syntactic annotation, as the tagset is based on 
concepts familiar to annotators. However, due to the focus on rapid annotation, 
the treebank lacks fine-grained morphological or syntactic annotation. This differs 
from the work for Classical Arabic presented in this thesis. For example, although 
ellipsis is commonly used to describe syntactic structure in traditional grammar, 
the Columbia treebank does not annotate empty categories. In contrast, the 
Quranic Arabic Corpus provides a fine-grained morphological representation with 
a richer tagset, as well as being more closely aligned to traditional concepts. 
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2.4 Statistical Parsing Models 

2.4.1 Classical Arabic Parsing 

Despite lower accuracy scores compared to English, Modem Arabic parsing is 
well established in computational linguistics research. State-of-the-art Modern 
Arabic parsers utilize data-driven statistical models and have been evaluated on 
large datasets, for both constituency and dependency representations. In contrast, 
almost no previous work has been published for parsing Classical Arabic. The few 
published studies are either descriptions of small experiments, or are discussion 
papers that outline possible approaches without providing clear descriptions of 
methodology or results. For example, Shokrollahi-Far et al. (2009) discuss their 
rule-based constituency parser. Although they outline a parsing experiment using 
verses of the Quran, they fail to explain their evaluation process in detail and do 
not report accuracy scores. Similarly, Shatnawi and Belkhouche (2012) describe a 
small experiment for parsing the Quran using a recursive descent parser. They 
generate constituency trees for a small 60-word sample of the Quran using hand- 
written grammatical rules but do not evaluate parsing performance. 

Previous work for Classical Arabic parsing has been limited by lack of data. 
Unlike for Modern Arabic, treebanks for Classical Arabic have not previously 
been developed, ruling out data-driven approaches to parsing using statistical 
methods. In contrast, the statistical parser described in this thesis is made possible 
by learning from a new manually-verified treebank. 



2.4.2 Arabic Constituency Parsing 

For Modern Arabic, using constituency phrase-structure to represent Arabic 
syntax has resulted in parsing underperformance. For example, Kulick et al. 
(2006) parse the Penn Arabic Treebank using Bikel’s parser (Bikel, 2004b). This 
is an improved reimplementation of Collins’ parser, a well-known model for 
constituency syntax (Collins, 1999). They report an Fl-score of 74% for Arabic, 
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but a much higher score of 88% for a similar sized English dataset. This suggests 
that parsing using a constituency representation is more suitable for English than 
for languages with relatively free word order such as Arabic. 

In a more recent comparison, Green and Manning (2010) measure the accuracy 
of three constituency parsers, including their own Stanford parser, against the 
Penn Arabic Treebank. Their results are not directly comparable to Kulick et al. 
since they use an alternative metric for measuring accuracy. Instead of Parseval, 
they use a leaf-ancestor metric, and report scores of 77.5% for Bikel’s parser, 80% 
for the Stanford Parser and 83.1% for the Berkeley parser (Petrov, 2009). 

These results fall short of state-of-the-art parsing performance for English. In 
addition to measuring accuracy, they investigate the causes of poor parsing results 
for the Penn Arabic Treebank. They conclude that low annotation consistency is a 
problem. They also note that using a constituency representation for Arabic does 
not capture important syntactic constructions not found in English: 



It is well-known that constituency parsing models designed for English 
often do not generalize easily to other languages and treebanks. The Penn 
Arabic Treebank (ATB) syntactic guidelines (Maamouri et al., 2004) were 
purposefully borrowed without major modification from English (Marcus et 
al., 1993). Further, Maamouri and Bies (2004) argued that the English 
guidelines generalize well to other languages. But Arabic contains a variety 
of linguistic phenomena unseen in English. The ATB is similar to other 
treebanks in gross statistical terms, but annotation consistency remains low 
relative to English. Our results suggest that current parsing models would 
benefit from better annotation consistency and enriched annotation in 
certain syntactic configurations. 



However, Green and Manning are able to improve parsing performance by 
supplementing the Penn Arabic Treebank with additional morphosyntactic 
features. Using this approach, they are able to boost the accuracy of a probabilistic 
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context-free parser from 75.95% to 80.95%, measured using the leaf-ancestor 
metric. The additional features they add to the treebank are designed to capture 
linguistic constructions that only occur in Arabic and not English, and are partly 
based on linguistic considerations from traditional grammar: 



For verbs we add two features. First we mark any node that dominates a 
verb phrase. This feature has a linguistic justification. Historically, Arabic 
grammar has identified two sentences types: those that begin with a nominal 
(Vlu’V 1 and those that begin with a verb But foreign 

learners are often surprised by the verbless predications that are frequently 
used in Arabic. Although these are technically nominal, they have become 
known as ‘equational’ sentences. [This feature] is especially effective for 
distinguishing root S nodes of equational sentences. We also mark all nodes 
that dominate an SVO (subject- verb-object) configuration. In MSA, SVO 
usually appears in non-matrix clauses. 



This thesis will address the limitations that the Penn Treebank’ s constituency 
representation has on Arabic parsing performance. For example, the annotation 
improvements suggested by Green and Manning are implemented in the Quranic 
Arabic Corpus. The suggested tags for nominal phrases 2 kL=2!) and verbal 

phrases arc explicitly annotated, as these are among the structures 

described by traditional Arabic grammar in Chapter 6. 



2.4.3 Arabic Dependency Parsing 

Most recent parsing work for Arabic has focused on dependency grammar, a 
representation better suited to modelling languages with free word order such as 
Arabic. The 2007 Conference on Computational Natural Fanguage Feaming 
(CoNFF) featured a shared task that evaluated statistical dependency parsers for 

2 In Arabic grammar, the concept <1^11 applies to clauses as well as phrases. The term 

‘nominal phrase’ is used here generally, to refer to nominal syntactic structures. 
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several languages (Nivre et al., 2007a). State-of-the-art parsers for Modern Arabic 
were tested in the shared task using data from the Prague Arabic Treebank 
developed by Hajic et al. (2004). As input, the parsers were provided with Arabic 
text with gold-standard morphological annotation, including part-of- speech tags, 
segmentation and features annotated from the treebank. The same approach is 
used in this thesis, where gold-standard morphological annotation is also assumed 
as input for evaluating a new Classical Arabic parser. 



Lead Author 


Parsing Model 


Score 


Nilsson 


Ensemble (combination of six models) 


76.52 


Nakagawa 


Global graph features using Gibbs sampling 


75.08 


Hall 


MaltParser 


74.75 


Sagae 


Ensemble (combination of three models) 


74.71 


Chen 


Unlabelled MaltParser + SVM labelling 


74.65 



Table 2.4: Top five statistical parsers for Arabic in the CoNLL shared task. 



A total of 20 Arabic dependency parsers were evaluated in the shared task. 
Table 2.4 summarizes the results of the top five parsers, measured using a labelled 
attachment score (LAS) metric. The best performing parser by Nilsson, described 
in Hall et al. (2007a), uses an ensemble system that combines the results of six 
parsing models using MaltParser (Nivre et al., 2007b). However, the top score of 
76.52% falls short of the performance of 88.1% reported for English dependency 
parsing in the same task. This work demonstrates that parsing the Prague Arabic 
Treebank is more challenging than English dependency parsing. 

These results contrast with recent work by Marton et al. (2013), who report 
improved parsing results for the Columbia Arabic Treebank. Like Hall et al., they 
also use MaltParser, and report a baseline FI -score of 81% for their Arabic 
dependency parsing model. They are able to increase parsing accuracy to 84% by 



35 




2 - Literature Review 



introducing a more fine-grained tagset with additional morphological features not 
included in the Columbia Treebank’s original annotation scheme. They conclude 
that the most useful features for dependency parsing that are missing from the 
treebank are definiteness, person, number, gender and lemma. This limitation will 
be shown to be addressed in the Quranic Arabic Corpus, which includes these 
additional features as part of its fine-grained annotation scheme. 



2.4.4 Dual Dependency-Constituency Parsing 

Within published literature, previous work that most closely resembles the hybrid 
dependency-constituency parsing algorithm developed in this thesis is the 
approach by Hall et al. for German (Hall and Nivre, 2008) and for Swedish (Hall, 
Nivre and Nilsson, 2007b). However, in contrast to the hybrid parser presented in 
Chapter 9, their combined model outputs two parse trees for an input sentence, 
providing distinct annotation for dependency and constituency representations. 
They also describe their approach as hybrid parsing. To avoid confusion, this 
thesis instead uses the term ‘dual parsing’ for their model. The term ‘hybrid 
parsing’ is reserved for the new algorithms presented in Chapter 9, which output a 
single graph using a hybrid dependency-constituency representation. 

The dual parsing algorithm described by Hall et al. extends MaltParser to 
output constituency trees by merging the two representations into dependency 
structures. The merged structures encode additional constituency information on 
enriched edge labels. The two diagrams overleaf illustrate the merging process for 
Swedish (Hall et al., 2007b). Figure 2.5 shows a constituency representation with 
an equivalent dependency representation. In Figure 2.6, the lower tree is a 
dependency structure with merged edges. Merging is possible if for every word w 
in a sentence, the sequence of words governed by w in the dependency tree is 
equal to the set of leaf nodes covered by a non-terminal node n in the constituency 
tree. In the merged representation, compound edge labels are of the form X I Y, 
where X is w’s dependency relation, and Y is n’s phrase- structure tag if n is not a 
preterminal, or an asterisk (*) otherwise. 
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Figure 2.5: Constituency and dependency representations for Swedish. 
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Figure 2.6: Dual dependency-constituency representation for Swedish. 
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Hall et al. build their statistical model for dual parsing by training MaltParser 
using data converted to the merged representation. To produce constituency trees, 
the merged output is post-processed after dependency parsing. An inverse 
transformation is applied that uses the information encoded on merged edges to 
restore constituency nodes and phrase- structure tags. For German, Hall and Nivre 
(2008) measure performance using constituency data from two German treebanks: 
the TIGER Treebank (Brants and Hansen, 2002) and the TiiBa-D/Z Treebank 
(Hinrichs et al., 2004). Using head-finding rules, dependency data is collected by 
automatically converting from the constituency representation in the treebanks. 
They report accuracy close to 90% for dependency parsing, measured using a 
labelled attachment score. Similarly, for Swedish, Hall et al. (2007b) report results 
of over 80% using the same metric. 

Dual parsing algorithms are relevant to the work in this thesis, which compares 
a hybrid parser to a multi-step dependency model that uses post-processing. A 
similar approach to Hall et al. will be used to encode constituency information 
onto merged edge labels for multi-step hybrid parsing. However, this approach 
will be adapted to the Classical Arabic syntactic annotation scheme. 



2.4.5 Parsing Models for Ellipsis 

To the best of the author’s knowledge, the work in this thesis describes the first 
dependency-based parsing model in any language for elliptical constructions. In 
syntactic treebanks, empty categories are used to represent words or phrases that 
are not written or pronounced in the original text, such as the elliptical annotation 
in the Penn Treebank for null complementizers and wh-movement. Figure 2.7 
overleaf shows an example from the Penn Treebank for the noun phrase ‘the man 
Sam likes’. This constituency tree annotates two empty categories. The node 
marked 0 is a null complementizer, i.e. ‘the man (that) Sam likes’. The second 
node marked *T*-1 is a co-indexed trace. 

Although no previous work exists for dependency parsing with ellipsis, related 
work has been done for constituency parsing. Gabbard et al. (2006) show that it is 
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possible to fully recover Penn Treebank-style trees for English including function 
tags and empty categories, by training a cascade of statistical classifiers. 



NP 



NP SBAR 

DT^NN WHNlT^T'" 
the man -NONE- Np" VP 

d) NNP VB£tT~''"'NP 



Sam likes -NONE- 

* T l-i 



Figure 2.7: Empty categories in a Penn English Treebank constituency tree. 



For Arabic constituency representations, Gabbard (2010) extends this approach 
to recover the empty categories annotated in the Penn Arabic Treebank. In his 
description of ellipsis restoration, Gabbard notes that both functional tags and 
elliptical structures are not generally considered in constituency parsing work: 

The syntactic structures produced by the most commonly used parsers are 
less detailed than those structures found in the treebanks the parsers were 
trained on. In particular, this is true of Collins (1999), Bikel (2004) and 
Chamiak (2000), which are very commonly used. The parsers do not 
recover two sorts of information present in all the Penn Treebanks (English, 
Arabic, Chinese, and historical). The first are annotations on constituents 
indicating their syntactic or semantic function in the sentence (Gabbard et 
al., 2006). The second ki nd of information is tree nodes which do not 
correspond to overt (written or pronounced) words. 
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For dependency representations, although various treebanks annotate elliptical 
structures, these have previously been ignored in parsing work. For example, 
Rello and Ilisei (2009) develop a Spanish corpus annotated with dropped subject 
pronouns using dependency grammar. This compares to Classical Arabic, where 
dropped subject pronouns also frequently occur. However, they use manual 
annotation for this task, as no dependency or constituency parsers for Spanish 
exist for these constructions. In related work, Bengoetxea and Gojenola (2010) 
use MaltParser to parse the Basque Dependency Treebank, which originally 
included empty categories to represent ellipsis and coordination. However, their 
work uses a newer version of the treebank in which the empty categories are no 
longer annotated in order to minimize the number of non-projective edges in the 
treebank and simplify parsing. 

Similarly, previous Arabic dependency treebanks do not annotate ellipsis, a 
limitation addressed in this thesis. In contrast to the post-processing approached 
described by Gabbard et al., the dependency-based parser that will be presented 
for Classical Arabic handles ellipsis in the hybrid representation directly in the 
parsing process. 



2.4.6 Hebrew Parsing Models 

Hebrew, another Semitic language, faces a similar set of challenges in comparison 
to parsing Arabic. Both languages have relatively free word order and require 
morphological disambiguation for syntactic parsing. Similar to recent work for 
Arabic, parsing work for Hebrew focuses on both constituency and dependency 
representations. For dependency parsing, Goldberg and Elhadad (2010), apply a 
pipeline approach by disambiguating morphology and syntax in two separate 
steps. They report an 84.2% labelled attachment score using gold-standard 
morphological input, and 76.2% using predicted morphological tagging. 

More recent work for Hebrew parsing has focused on joint morphological and 
syntactic models. In contrast to pipeline approaches, in which the output of a 
morphological analyzer is given to a syntactic parser, this approach utilizes an 
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integrated statistical model. Tsarfaty (2006) argues that for Semitic languages 
such as Arabic and Hebrew, morphological disambiguation is dependent on 
syntactic context, and that combined models lead to improved performance. This 
is demonstrated by Goldberg and Elhadad (2011), who perform joint parsing 
using a lattice segmentation model for Hebrew. Using the Berkley parser (Petrov, 
2009), they report an Fl-score of 77.3% using a pipeline approach, and 79.9% for 
joint disambiguation. 

Similar to Goldberg and Elhadad’s evaluation methodology, the Classical 
Arabic parser developed in this thesis will be evaluated by considering a pipeline 
approach as a baseline, in which the output of a dependency parser is converted to 
the hybrid representation. This will be compared to a one-step dependency- 
constituency parser that uses a joint model for the hybrid representation. 
However, joint morphological disambiguation for Classical Arabic is beyond the 
scope of this thesis. Although recent work for Hebrew suggests that joint models 
outperform pipeline approaches, joint morphological disambiguation has not yet 
been performed for Arabic, and Arabic statistical parsers are generally evaluated 
using gold-standard morphological input. 

2.5 Annotation Methodologies 

This section reviews previous work for three annotation methodologies: offline 
expert annotation, online crowdsourcing, and supervised collaboration - the 
methodology used to annotate the Quranic Arabic Corpus. 

Most annotated corpora are developed by experts who annotate a corpus 
manually, following an annotation scheme and a set of annotation guidelines. 
Crowdsourcing is an emerging alternative methodology in which a large number 
of non-experts repeatedly annotate a corpus. These independent annotations are 
combined to achieve high reliability, using an aggregate metric such as majority 
voting or statistical weighting. These methodologies contrast with recent work for 
supervised collaboration, a third approach to annotation where non-experts 
produce annotations collaboratively under expert supervision. 
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2.5.1 Expert Annotation 

Inter- annotator agreement for corpora annotated by experts is important for 
consistent and high-quality annotation. However, agreement between annotators 
can be difficult to achieve, requiring training, clear guidelines, and reconciling 
different annotator results to produce the final gold-standard annotation. Kilgarriff 
(1998) investigates the factors that affect inter- annotator agreement for word- 
sense tagging. He notes that two important reasons for inconsistent results 
between experts are a poorly-defined annotation scheme and mistakes by 
annotators due to lack of motivation or misunderstanding the annotation task. 

For syntactic annotation, Brants (2000c) analyzes the annotation accuracy of 
the German NEGRA Treebank. Initial annotation of the treebank was performed 
quickly by two experts who manually corrected the output of a syntactic parser 
(Skut et al., 1997; Brants, Skut and Uszkoreit, 1999). Brants reports an initial 
annotation speed of 50 seconds per sentence for each annotator on average. In 
contrast, total annotation time was measured at 10 minutes per sentence for the 
final gold-standard. This included the time spent by two annotators independently 
reviewing each sentence, performing a comparison of each other’s work, and 
discussing and correcting differences. Initial inter-annotator agreement before 
discussion was 98.57%. Agreement between the initial versions and the final 
gold-standard was 98.8%. This work shows that despite comparison and review, 
disagreement between experts leads to an upper bound on annotation accuracy 
when measured using inter- annotator agreement. 

Even widely used resources such as the Penn English Treebank have limits on 
data quality. Marcus et al. (1993) report an inter- annotator agreement of 97% for 
the part-of-speech tagging in the treebank. However, Manning (2011) analyses the 
quality of annotation by training a part-of-speech tagger and classifies its errors 
against a sample of sentences from the Penn Treebank (Table 2.5, overleaf). 
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Class 


Frequency 


Lexicon gap 


4.5% 


Unknown word 


4.5% 


Could plausibly get right 


16.0% 


Difficult linguistics 


19.5% 


Underspecified/unclear 


12.0% 


Inconsistent/no standard 


28.0% 


Gold standard wrong 


15.5% 



Table 2.5: Errors for automatic part-of-speech tagging for the Penn Treebank. 



Manning classifies 12% of errors from the output of the tagger as due to 
underspecified or unclear part-of-speech tags. These errors resulted from tags 
being ambiguous or unclear to annotators, such as whether to choose a verbal or 
noun tag for gerunds. A further 28% of errors are attributed to inconsistent 
guidelines. Similar to Kilgarriff s work on inter- annotator agreement for word- 
sense tagging, this work shows that annotation guidelines need to be clear and 
easily understandable even to expert annotators. 

2.5.2 Crowdsourcing, Voting and Averaging 

In contrast to expert annotation, crowdsourcing is an alternative approach that has 
proven to be effective for a wide variety of tagging tasks, with accuracy 
approaching that of expert annotation. Crowdsourcing is attractive because it is 
cost effective, allowing for large-scale annotation tasks that would otherwise be 
prohibitively expensive. 
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Nowak and Riiger (2010) investigate the effectiveness of crowdsourcing for 
annotating Flickr photos with concept tags. Using 11 expert annotators, they 
report an inter-annotator agreement of over 90%. Expert annotation was compared 
to the results of using Amazon Mechanical Turk, an online crowdsourcing 
marketplace. Using an averaging method based on majority voting, inter- 
annotator agreement was found to be comparable to expert annotation. Although 
these results indicate that crowdsourcing is viable, Nowak and Riiger suggest 
further analysis by annotating larger datasets. 

A wider variety of linguistic annotation tasks are considered by Snow et al. 
(2008). Amazon Mechanical Turk is used for five tagging tasks: affect recognition 
(100 sentences), word similarity (30 word pairs), recognizing textual entailment 
(800 sentence pairs), event temporal ordering (462 verb event pairs) and word 
sense disambiguation (177 sentences). They note that Amazon Mechanical Turk is 
cost effective. For example, they paid only USD $2 to collect 7,000 non-expert 
annotations for the affect recognition task. 

To boost annotation accuracy, a statistical model is used to correct for the 
reliability and biases of individual annotators. Using a multinomial model similar 
to naive Bayes, results are combined by assigning annotators who are more than 
50% accurate positive votes, annotators whose judgments are pure noise zero 
votes and anti -correlated annotators negative votes. This statistical weighting 
increases the accuracy of the annotation tasks by up to 4%, compared to majority 
voting. Snow et al. report that for most annotation tasks, only a small number of 
non-experts are required to achieve accurate annotation. For example, for the 
affect recognition task, the combined results of just four non-experts are required 
to emulate the quality of expert-level annotation. 

In contrast to the small-scale experiments described above, an example of a 
large-scale corpus developed through crowdsourcing is the Phrase Detectives 
corpus, containing 1.1 million words annotated with 380,000 anaphoric relations 
(Chamberlain et al., 2009). In the description of their annotation methodology, 
they note that crowdsourcing is an attractive alternative to expert annotation: 
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The statistical revolution in natural language processing (NLP) has resulted 
in the first NLP systems and components really usable on a large scale, from 
part-of- speech (POS) taggers to parsers (Jurafsky and Martin, 2008). But it 
has also raised the problem of creating the large amounts of annotated 
linguistic data needed for training and evaluating such systems. This 
requires trained annotators, which is prohibitively expensive both 
financially and in terms of person-hours (given the number of trained 
annotators available) on the scale required. 



Their solution is to motivate annotators through entertainment, by casting the 
annotation task as an online game. Phrase Detectives provides an interactive web- 
based interface for non-experts to learn how to annotate text and make annotation 
decisions. Following a training phase, the game runs in two modes. In annotation 
mode, players locate the closest markable antecedent of an anaphor. In validation 
mode, players are asked to review previously annotated sentences. Final 
annotations are selected through majority voting. The effectiveness of this 
methodology is measured by annotating a section of the corpus using two expert 
annotators. Inter- annotator agreement between the experts was 94%, compared to 
93% between experts and non-experts. This demonstrates that large-scale 
annotation tasks can be highly reliable using crowdsourcing. 

2.5.3 Supervised Collaboration 

Supervised collaboration is an annotation methodology involving the online 
collaboration of multiple annotators whose work is reviewed by supervisors acting 
as editors. This methodology can be considered to be a middle ground between 
offline expert annotation and crowdsourcing. Supervised collaboration is also cost 
effective, but ensures reliability through expert supervision. 

Perhaps the best example of a fully collaborative resource is Wikipedia, 
constructed entirely by unpaid volunteers who are motivated by the interest they 
share in the articles being developed. Recent research has consistently shown that 
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the effectiveness of Wikipedia depends on incremental edits to improve quality, 
but also crucially on open communication and discussion between editors to 
resolve issues, and to promote common understanding (Kittur and Kraut, 2010). 

Collaborative annotation with inter-annotator discussion has recently been used 
to develop specialist corpora that require the participation of expert annotators. 
For example, the Ancient Greek Dependency Treebank (Bamman et al., 2009) is 
developed by annotators with backgrounds ranging from advanced undergraduate 
students to recent PhD graduates and professors. The treebank provides syntactic 
annotation for 200,000 words of Ancient Greek texts, including the works of 
Hesiod, Homer and Aeschylus. It is unlikely that annotating the treebank could be 
performed effectively using a crowdsourcing marketplace such as Amazon 
Mechanical Turk, given the prerequisite knowledge required. Instead, the treebank 
was annotated using supervised annotation, with different groups of annotators 
developing different sections of the treebank. Every sentence was annotated by 
two annotators and the differences were reconciled by an expert with specialist 
knowledge of the text. 

In addition to an initial training period, annotators are actively engaged in new 
learning and collaboration by means of an online forum in which they can ask 
questions of each other and of project supervisors. Using this method, average 
annotator agreement for dependency relations was 80.6% compared to the final 
gold standard, measured using a labelled attachment score metric. 

The complexity of syntactically tagging Ancient Greek is demonstrated by the 
time and effort required to produce annotations. Average annotation speed was 
measured at only 124 words per hour. This compares to the Penn English 
Treebank, where annotator speed has been reported as 1,000 words per hour after 
four months training (Taylor, Marcus and Santorini, 2003). Bamman et al. argue 
that a collaborative methodology is more suitable for the creation of a scholarly 
treebank, given the specialist nature of the annotations. Supervised collaboration 
allows annotators with different levels of expertise to participate in the annotation 
process, while ensuring that annotations remain consistent and of a high quality. 
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2.6 Conclusion 

This chapter reviewed previous work in four areas: morphological representations, 
syntactic representations, parsing and annotation methodologies. This section 
summarizes the implications of the reviewed work in relation to the thesis 
research questions. 

For annotation methodologies, the review contrasted the approaches of expert 
annotation, crowdsourcing and supervised collaboration. In comparison to expert 
annotation, crowdsourcing was found to be cost effective for a wide variety 
annotation tasks, producing annotation of comparable accuracy (Snow et al., 
2008; Chamberlain et al. 2009). Supervised collaboration is an alternative 
approach that is also cost effective but is better suited to tasks requiring expert 
supervision, such as syntactic annotation of the Ancient Greek Treebank 
(Bamman et al., 2009). This compares to the Quranic Arabic Corpus, where 
annotation also requires specialist knowledge. The implication of this work is that 
supervised collaboration may be an appropriate methodology for annotating 
Classical Arabic, a research question that will be addressed in Chapter 7. 

From the literature on Arabic syntactic representations, a key theme is that 
although both representations are used, dependency representations are preferred 
to constituency representations, as Arabic is a language with free word order. The 
Penn Arabic Treebank (Maamour et al., 2004) is the only treebank that uses a 
constituency representation. In contrast, the Prague (Smrz and Hajic, 2006) and 
Columbia (Habash et al., 2009c) treebanks are dependency based, although only 
the Penn Treebank performs fine-grained syntactic annotation of constructions 
such as ellipsis. The work reviewed for Arabic parsing (Kulick et al., 2006; Green 
and Manning, 2010) implies that constituency representations impose limitations 
on annotation consistency and parsing accuracy. However, both types of 
representation have resulted in lower performance for Modem Arabic compared 
to English using similar parsing models. 

A second theme that emerged from the review on Arabic morphological and 
syntactic work is that many projects base their representations on traditional 
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Arabic grammar. For morphology, there is consensus in the literature that using a 
fine-grained approach based on traditional concepts leads to improved annotation 
(Habash, 2007a; Sawalha and Atwell, 2010). For syntax, Smrz and Hajic (2006) 
note that despite traditional Arabic grammar being over a thousand years old, it is 
based on similar concepts to modern representations such as dependencies and 
functional roles. Work on syntactic annotation for the Columbia Arabic Treebank 
(Habash and Roth, 2009c) has shown that annotators prefer to work with 
traditional grammar using familiar concepts and terminology. This has resulted in 
less annotator training and improved inter- annotator agreement and annotation 
consistency. 

The implication of these two themes is that although traditional grammar is 
often cited as an inspiration for Arabic computational work, there is ongoing 
debate on how best to represent Arabic syntax using traditional concepts, with 
opinion in favour of dependency representations. An alternative representation 
could be a hybrid representation. Work on dual dependency-constituency parsing 
for German (Hall and Nivre, 2008) and for Swedish (Hall et al., 2007b) has 
demonstrated the feasibility of merged syntactic representations for statistical 
parsing. Similarly, work reviewed for Hebrew showed that integrated models can 
outperform pipeline approaches. For example, Goldberg and Tsarfaty (2008) 
integrate morphological and syntactic disambiguation and report improved 
parsing performance for their task. 

For Classical Arabic, a thesis research question asks if a dependency-based 
representation that incorporates aspects of constituency syntax will be suitable for 
statistical parsing. This thesis will argue that this representation is more closely 
aligned to historical traditional analyses. The next chapter provides relevant 
context for this argument, providing background information on the Arabic 
linguistic tradition. 
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Verily, your mistakes in grammar are more difficult for 
me to bear than your mistakes in archery! 

- Umar ibn al-Khattab, the second Caliph 



3 Historical Background 

3.1 Introduction 

Together with the Indian, Greek and Chinese languages, Arabic has one of the 
world’s major linguistic traditions. The key developments in Arabic linguistics 
occurred during the Islamic Golden Age (750-1250), a time of rapid advances in 
philosophy, science and medicine (Hayes, 1992; Meri and Bacharach, 2006). A 
large number of grammarians contributed to Arabic linguistics. From 750-1500, 
the names of over 4,000 grammarians are known (Versteegh, 1997a). Figure 3.1 
(overleaf) shows a timeline of historical events relevant to the work in this thesis. 

3.2 Motivations of the Early Arabic Grammarians 

Arabic grammarians were motivated to understand and describe the details of 
Classical Arabic because it is the language of the Quran. Adherents of the Islamic 
faith believe that the Quran is the literal word of God, revealed to the Prophet 
Muhammad over a 23 year period, from 609 to 632, the year of the his death 
(Lings, 1983; Al-Azami, 2003). The Quran is written in Classical Arabic, largely 
in a style of rhymed prose known as saj ’ (<y^). Even among non-Islamic scholars 
of Arabic, the Quran is widely regarded as a masterpiece of literature due to its 
eloquent and beautiful use of language. For example, Stewart (2000) notes that: 



3 A detailed description of the history of the Arabic linguistic tradition is beyond the scope of 
this chapter. Introductory surveys can be found in Owens (1988), Bohas et al. (1990), Versteegh 
(1997a), Al-Liheibi (1999) and Jiyad (2010). 
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328: Earliest known Arabic inscription at Namara in the Nabataean alphabet. 

609-632: Revelation of the Quran to the Prophet Muhammad. 

632: Death of the Prophet Muhammad. Islam begins to spread rapidly. 

603-688: Abu al-Aswad al-Du’ali: First Arabic grammarian. Analyzed parts of 
speech, conjunctions, attributes, exclamations and interrogatives. 

685-705: Reign of the Caliph Abd al-Malik ibn Marwan. Arabic becomes the 
lingua franca and sole administrative language of the Islamic empire. 

750: Islamic empire controls a vast area of land including Southern France, 
Spain, North Africa, the Middle East, the Indus Valley and Central Asia. 

718-786: Al-Khalil: Introduces vowel marks into Arabic script ( harakat ) and 
the study of prosody (al- ‘arud). First Arabic dictionary (kitdb al- ‘ ayn ). 

731-822: Al-Farra: Establishes that grammar is key to understanding the Quran. 

760-796: Sibawayh: The Book of Grammar ( al-kitab fi an-nahw), a seminal 
treatise that introduces syntactic governance ( ‘amal wa ‘amil). 

830: Al-Akhfash: Describes rhetorical structures in the Quran. 

826-898: Al-Mubarrad: Collects a corpus of Classical Arabic prose and poetry. 

892-951: Al-Zajjaji: Explores the relationship between grammar and logic. 

932-1002: Ibn Jinni: Detailed work on Arabic phonology and morphology. 

1075-1144: Al-Zamakhshari: Deep linguistic analysis of the Quran. 

1256-1345: Abu Hayyan: Concepts from Arabic linguistics are applied to 
develop functional grammars for Turkic, Ethiopian and Mongolian. 

1308-1359: Ibn Hisham: Fine-grained classification of parts-of-speech. 

1859: Publication of Wright’s grammar in English for the Arabic language. 

1863: Lane’s lexicon: An Arabic-English lexicon based on traditional sources. 

Figure 3.1: Timeline of key developments in Classical Arabic grammar. 
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It is widely agreed that the Quran is a beautiful text. Umar ibn al-Khattab, 
later the second Caliph, vehemently opposed the Prophet’s early preaching 
in Mecca but was so moved upon hearing [the Quran] recited that he 
converted on the spot. What is it that makes the Quran so beautiful and that 
renders any translation a pale shadow of the original? Rhyme and rhythm 
are certainly the most outstanding elements lost in translation. The Quran is 
a profoundly artistic and indeed poetic text. 



Following the rapid spread of Islam, the Quran became the central religious text 
for a large number of non- Arabs, with Arabic as their lingua franca. By 750, the 
Umayyad Caliphate had grown to become the largest empire the world had ever 
seen up to that time, controlling a vast area of land that included Southern France, 
Spain, North Africa, the Middle East, the Indus Valley, and Central Asia up to the 
borders of China (Hawting, 2000). However, grammatically correct Arabic was 
often not spoken among the diverse ethnic groups within Islamic civilization. 
Solecisms, termed lahn (u=2), became more frequent as Islam spread (Al-Liheibi, 
1999). Concerns over incorrect recitation of the Quran motivated early Arabic 
grammarians to produce detailed work documenting its linguistic rules. 

A later motivation was shu ’ubiyya. This movement sought to counter the spread 
of Arabic culture through the Islamic principle of racial equality. Following the 
conquest of Persia, from the late 8th century a resurgence in Persian identity 
questioned the dominance of Arabic. Prominent Arabic grammarians responded 
by detailing the language’s unique features (Suleiman, 2003). For example, Al- 
Zamakhshari (1075-1144) felt motivated to produce deep linguistic analyses of 
the Quran in response to criticisms of Arabic on cultural grounds. 

In comparison to modern linguistics, the aims and motivations of traditional 
Arabic grammar differed in two respects. Firstly, concerned by ungrammatical 
language and motivated to preserve the language of the Quran, grammarians were 
primarily interested in describing Arabic’s linguistic rules. Secondly, in common 
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with believers of Islam today, the grammarians considered the Quran’s language 
to be perfect. Driven by their beliefs, they produced detailed analysis of a wide 
variety of linguistic phenomena, developing a comprehensive theory of grammar. 

3.3 Analytical Methods in Traditional Grammar 
3.3.1 Analogical Deduction ( qiyas ) and Causation ( ta’lil ). 

Despite their different motivations, the analytical methods used by traditional 
grammarians are similar to modern empirical methods. For example, they placed 
importance on using linguistic data in preference to constructed examples. The 
grammarians were interested in describing the purest form of Arabic and focused 
on examples from which evidence could be drawn to support various linguistic 
arguments. Their corpora included the Classical Arabic text of the Quran, 
collections of pre-Islamic poetry, and the speech of the Bedouin, who were 
believed to speak a pure form of Arabic having avoided contact with foreigners. 
An example of this method is the work of Al-Mubarrad (826-898) who collected a 
corpus of Classical Arabic prose and poetry for linguistic analysis in The Book of 
the Perfect (kitab al-kamil ). 

Based on linguistic data, the two main analytical methods used by traditional 
grammarians were analogical deduction ( qiyas - and causation ( ta III - J2*j). 

Analogy is a process used in Islamic jurisprudence, where rulings for situations 
not described in the Quran are derived through deduction. The same principle was 
used in linguistics. Arabic grammarians described the structure of new sentences 
in their corpora based on previous analyses using analogy, by comparing them to 
similar structures from the Quran and related texts. 

The principle of causation was also a key analytical method. The grammarians 
believed the form of language used by native speakers had underlying causes, 
such as the rules that relate syntactic function to inflectional case endings. For 
example, for certain sentences, the cause of a noun being in the nominative case 
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would be due to a grammatical rule that states that all nouns which are subjects of 
verbs are found in the nominative. Similarly, the reason for certain nouns being in 
the accusative case would be the rule that all nouns which are objects of verbs are 
found in the accusative (Owens, 1989). Using the data from corpora together with 
the principles of analogy and causation accelerated the elucidation of Classical 
Arabic’s rules, as various linguistic theories could be efficiently evaluated against 
accepted grammatically correct texts. 

3.3.2 The Basran and Kufan Schools 

Although traditional grammarians made advances in Arabic linguistics, there was 
not always consensus in their approaches. Early on in the development of 
traditional grammar, two competing schools emerged in the Iraqi cities of Basra 
and Kufa. The Kufans are usually credited with initiating grammatical analysis. 
For example, although there are several candidates, Abu al-Aswad al-Du’ali (603- 
688) is often cited as the first Arabic grammarian. He was commissioned by the 
fourth Caliph, Ali ibn Abi Talib to document the rules of the Arabic language. 
Jiyad (2010) recounts the following story, often cited in later works of traditional 
grammar: 



I came to the leader of the believers, Ali ibn Abi Talib. He said, ‘I have been 
thinking about the language of the Arabs and how it has been corrupted 
through contact with foreigners. I have decided to put something that they 
(the Arabs) refer to and rely on.’ He gave me a note which said: ‘Speech is 
made of nouns, verbs and particles. Nouns are names of things, verbs 
provide information, and particles complete meaning.’ He said to me, 
‘Follow this approach and add to it what comes to mind.’ I wrote chapters 
on conjunctions, attributes, exclamations and interrogatives. Every time I 
finished a chapter I showed it to him until I covered what I thought to be 
enough. He said, ‘How beautiful is the approach you have taken!’ From 
there, the concept of grammar came to exist. 
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The Basran and Kufan schools developed Arabic grammar at the same time, 
and were often engaged in competitive discussions. Although the Kufans are 
credited with originating grammar, Basran works have been far more influential to 
later grammarians (Owens, 1988). In contrast to Kufa, a city that attracted many 
Bedouin Arabs, Basra had a more mixed population combining Arabic and 
Persian cultures. The two schools of thought had different approaches to linguistic 
analysis. The Basran school made stronger use of analogy and restricted their 
analysis to the pure speech of Arabs. The Kufans had more prescriptive views. 
They tended to cite anomalous linguistic forms in the analysis of grammatical 
constructions, and were more interested in different readings of the Quran. 

Both schools adopted different terminology for linguistic constructions. Due to 
the larger influence of the Basran school, their terminology became more 
standardized and was used in later works. For example, the Arabic linguistic 
construction of specification is today widely known by the Basran term tamyiz 
instead of the Kufan term mufassir (Al-Liheibi, 1999). Kufan terminology is 
rarely used today, except in comparative work. 



3.3.3 Al-Khalil and Sibawayh 

The grammarian Al-Khalil (718-786) was a founding member of the Basran 
school. His accomplishments include introducing standardized vowel marks into 
Arabic script ( harakat ) and founding the study of Arabic prosody ( al - ‘arud). He 
also produced the first Arabic dictionary (kitab al- ‘ ayn ) using citations from the 
Quran and Classical Arabic poetry. His convention of organizing the lexicon by 
root then lemma has been adopted by later Arabic dictionaries, including those for 
Modem Arabic. However, he chose to sort entries using a phonetic listing instead 
of alphabetically, the method more commonly used today. 

Al-KhaliTs student Sibawayh (760-796) is widely regarded as the greatest of all 
Arabic grammarians. He originally arrived in Basra with the intention of studying 
Islamic law. A well-documented incident tells of Sibawayh learning a phrase that 
contained an important religious ruling. When asked to recite the phrase back to 
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his tutor, Sibawayh mispronounced the vowelized case-ending of a single word, 
and his tutor publically corrected him. Aware that this mistake would have never 
been committed by a native Arabic speaker, Sibawayh, a Persian, felt shamed and 
embarrassed. He declared, ‘I will seek such knowledge, that no-one will be able to 
accuse me of making mistakes’ (Carter, 2004). 

Instead of continuing to study law, Sibawayh turned his attention to mastering 
Arabic grammar. His magnum opus was a 1,000-page sophisticated and detailed 
treatise known simply as ‘The Book’ ( al-kitab ), which to this day remains the 
authoritative work on Classical Arabic grammar. Sibawayh’s kitab is often ranked 
on par with work of other great historical linguists, such as Panini’s Ashtadhyayi 
for Classical Sanskrit (Baalbaki, 2008). Sibawayh envisioned an all-encompassing 
grammatical system that would account for the phonology, morphology and 
syntax of Classical Arabic. Carter (2004) notes that: 



Sibawayh is the founder not only of Arabic grammar but also of Arabic 
linguistics, which are by no means the same thing. Furthermore, as becomes 
obvious with every page of his kitab , he was also a genius, whose concept 
of language has a universal validity. When we bear in mind that he was 
probably not even a native speaker of Arabic, being the son of a Persian 
convert, his achievement becomes all the more astonishing. 



A crucial insight of Sibawayh’s analysis is that words in an Arabic sentence 
govern other words to produce distinctive changes in pronunciation. For example, 
if certain particles are placed before a verb, they change the verb’s grammatical 
mood and affect its morphological inflection and surface form. This simple idea 
led to grammatical analysis that focused on analyzing sentence structure by 
describing the syntactic relationships between words in order to explain 
morphological inflection. Concepts from Sibawayh’s seminal work on syntactic 
governance will be used for the syntactic representation in Chapter 6. 
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3.4 Further Developments 

Sibawayh’s grammatical analysis had a lasting influence on the Arabic linguistic 
tradition, and his kitab introduced concepts that were extended and refined by 
later grammarians. These included Al-Zajjaji (892-951), who considered the 
relationship between grammar and logic (Zabarah, 2005; Versteegh, 1995), Abu 
Hayyan (1256-1345) who applied concepts from Arabic linguistics to develop 
functional grammars for other languages including Turkic, Ethiopian and 
Mongolian (Versteegh, 1997b), and Ibn Hisham (1308-1359) who introduced a 
fine-grained classification for parts-of-speech, focusing on grammatical particles 
(Gully, 1995). By the time of grammarians such as Ibn Hisham, Arabic linguistic 
analysis reached a stage of sophistication approaching that of modern theories, 
with highly detailed descriptions of Arabic’s phonology, morphology, syntax and 
rhetorical structures. Later work by Orientalists introduced the Arabic linguistic 
tradition to the Western world. Examples include Lane’s Arabic-English Lexicon, 
published in 1859, (Lane, 1992), and Wright’s grammar of the Arabic Language 
in 1863 (Wright, 2007). Both of these works are based on traditional sources, use 
terminology from traditional Arabic grammar and are highly cited in later work. 

Although the early Arabic grammarians provided detailed analysis of examples 
from the Quran, more recent work has focused on comprehensive analysis of the 
entire text. The Quranic Arabic Corpus uses as its primary reference Salih’s work 
al-i’rab al-mufassal likitab allah al-murattal (‘A Detailed Grammatical Analysis 
of the Recited Quran using i’rab'), which collates previous analyses of historical 
Arabic grammar into a single reference work. This analysis of the Quran’s 
morphology and syntax is over 10,000 pages long, spans 12 volumes, and 
provides detailed linguistic analysis for each of the 77,429 words in the Quran 
(Salih, 2007). This detailed work would not have been possible without building 
on centuries of previous analysis by historical Arabic grammarians. 
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3.5 Conclusion 

This chapter provided historical background on the Arabic linguistic tradition, 
describing the aims, motivations and analytical methods of early Arabic 
grammarians. The Arabic linguistic tradition is a synthesis of the work of many 
grammarians, but certain key works have defined the field, introducing 
standardized terminology and grammatical concepts. Although this thesis will use 
sources from across this tradition, the syntactic work of Sibawayh stands out as 
one of the main sources of inspiration for developing the hybrid representation for 
Classical Arabic syntax. As will be discussed further in Part II, later works that 
build on this tradition, such as the comprehensive analysis by Salih (2007), will be 
used as primary references for annotation work. 
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Part II: Modelling Classical Arabic 




The invention of the alphabet was a singular event in 
human history, a revolutionary as well as unique gift to 
human civilization. 

- Frank Moore Cross 



4 Orthographic Representation 

4.1 Introduction 

Part II of this thesis is divided into three chapters that describe a formal model of 
Classical Arabic. The model consists of representations for Classical Arabic’s 
orthography (this chapter), morphology (Chapter 5), and syntax (Chapter 6). The 
representations are based on concepts from the Arabic linguistic tradition, and are 
used for two purposes. Firstly, they are used to develop the annotation scheme for 
the Quranic Arabic Corpus, described in this part of the thesis. Secondly, the 
representations are used to develop a computational model for Classical Arabic 
statistical parsing, described in Part IV. 

Formal models are representations of systems within a defined mathematical 
framework. They are descriptions that utilize formal concepts such as set theory, 
logic, data structures and transformational rules. In formal linguistics, they are 
used to analyze linguistic structures, such as the grammatical rules that underlie 
sentence construction. In corpus linguistics, formal representations lead to 
annotation schemes for annotating corpora. Although the formalization of 
Classical Arabic in this thesis draws on a large body of work from the Arabic 
linguistic tradition, adapting these works into a well-defined representation is 
challenging. In Arabic grammatical theory, linguistic structures are analyzed 
through prose, in contrast to modem approaches that utilize formal methods. 
Despite this, similar concepts are used in comparison to modern linguistics, such 
as morphological segmentation, part-of-speech classification, dependencies and 
semantic analysis. 
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The comparison between formal methods and historical analysis in Arabic 
grammar parallels the development of early Islamic mathematics. For example, 
Al-Khwarizmi (780-850) (from whose name the term ‘algorithm’ is derived) put 
forward solutions to the quadratic equation as part of the development of algebra 
(Kleiner, 2007; Katz, 1998). Al-Khwarizmi did not use formal notation for his 
equations, but instead performed mathematics rhetorically, recording his analysis 
in prose. However, his analysis for solving equations remains relevant today. 
Although modern mathematical notation for the quadratic appeared around the 
16th century (e.g. Viete), the widespread use of formal notation for linguistic 
structures is more recent, starting with Chomsky (1957). In comparison, the use of 
formal methods for Classical Arabic can be seen as introducing notation and 
convention to an existing tradition. The aim of the formal model in Part II of this 
thesis is to represent the same analyses found in historical works of traditional 
Arabic grammar. This difference is that unlike the descriptions in prose, formal 
descriptions allow for further computational work such as parsing. 

This chapter focuses on an orthographic representation for Classical Arabic. To 
relate to other Arabic resources, such as electronic lexicons, this representation 
must be convertible to Unicode, the computing standard for multilingual text. 
However, Unicode may not be the best choice as an internal format because the 
same Classical Arabic word can have multiple representations in Unicode as 
different combinations of diacritics and letters, or as pre-composed characters. In 
addition, the Arabic script of the Quran requires special processing to handle 
complex markings such as prosodic recitation marks not found in Modem Arabic. 
To address these issues, this chapter describes JQuranTree, a new open source 
component for the Quran. The component uses a novel character-plus-diacritic 
representation that has an unambiguous mapping to Classical Arabic, simplifying 
its processing. 

The remainder of this chapter is organized as follows. Section 4.2 provides an 
overview of Quranic orthography. Section 4.3 describes the formal orthographic 
representation and section 4.4 describes the computational model, relating this to 
other approaches such as Buckwalter transliteration. Section 4.5 concludes. 
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4.2 Quranic Orthography 
4.2.1 The Uthmani Script 

Historically, copies of the Quran have been written in almost exactly the same 
way, with the exception of slight variations in spelling. The two most prominent 
variations are warsh (u^jj Vjj), used in North Africa, and hafs (o 3 ^. CIjj), the 
narration used more widely across the Islamic world (Brockett, 1988). As 
comparative work is beyond the scope of this thesis, a single copy of the Quran 
was chosen for annotation. The Quranic Arabic Corpus is based on the madinah 
mushaf published by the Quran Printing Complex in 

Madinah. This copy is a hafs narration written in the Uthmani script, named after 
its calligrapher Uthman Taha. The madinah mushaf is widely considered to be 
highly accurate in comparison to traditional sources, and since 1985, the Quran 
Complex has printed over 200 million copies of the Quran (Mattson, 2012). 

Figure 4.1 (overleaf) shows the composition of the Uthmani script for part of 
verse (6:76). Arabic is written from right-to-left using a connected cursive script 
that is more complex compared to scripts for languages such as English. In early 
historical copies of the Quran, letters were written in their base form, similar to 

(A) in Figure 4.1 (Al-Azami, 2003). This form includes consonants and long 
vowels. However without pointing, letters are ambiguous, such as the letters fa’ 
and qaf in their frontal positions. Later copies included points to distinguish letters 

(B) , and diacritics known as tashkil for the precise pronunciation of short vowels 

(C) . The madinah mushaf also includes pause marks to indicate when readers 
should start and stop in longer verses, as part of a prosodic mark-up system (D). 

Due to the nature of the Quran as a central religious text, the script is designed 
to be as unambiguous as possible, encoding detailed information about correct 
pronunciation and recitation. These diacritics will be used in Chapter 7 to guide 
automatic morphological annotation of the Quran. In contrast, this supplementary 
data is not available in Modem Arabic, which is almost always written without 
diacritics, requiring readers to infer vowelization using linguistic knowledge. 
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(A) Base script 




(B) Pointed script 


jlA !«■_ j 


(C) Diacritics 




(D) Pause marks 





Figure 4.1: Structure of Quranic script for verse (6:76) in the madinah mushaf. 



4.2.2 The Tanzil Project 

Although digital copies of the Arabic text of the Quran have been available since 
the early 1980s, these were not as accurate as printed copies, often containing 
typographical errors (Khan and Alginahi, 2013). As recently as 2008, searching 
for Quranic verses using Google would result in spelling mistakes in the highest 
ranked search result, such as CijAA instead of Gjj- 44 in Figure 4.2: 



Go 05 




C 


^ Search: ® the web 


Web 



Tip: Search for English results only . You can spe 

lx/JA-4 CC *->3 2j i - [ Translate this page ] 

HOLY-BOOK ! COPYRIGHT . or' jH* • 

... .©2003 BY Kl H.AII rights ret 



Figure 4.2: Incorrect Google results for verse (68:38), as of January 21, 2008. 
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In contrast to previous work, such as the morphological analysis by Dror et al. 
(2004) described in section 2.2.4, JQuranTree uses orthographic data from the 
Tanzil project (Zarrabi-Zadeh, 2011). Released in 2008, this is the only accurate 
digital copy of the Quran. To ensure accuracy, this project was developed using a 
multi-stage approach. In the first stage, previous digital copies of the Quran were 
compared to produce an initial candidate text. This was followed by automatic 
verification using a set of morphological rules based on traditional grammar. The 
final stage was manual verification. Verse checksums were computed manually 
using all letters and diacritics from the madlnah mushaf and then compared to the 
digital version. The orthographic representation in this chapter is based on the 
Uthmani hafs data published by Tanzil Project as a Unicode dataset. 

4.3 Formal Representation 

Unicode is a computing standard for representing text that covers most of the 
world’s writing systems and is used as a data format for exchanging multilingual 
information. Formally, a Unicode string s is a sequence of Unicode characters: 



s = (ci, ..., c„) I Ct E U (1 <i< n) 



Each Unicode character c, from the set of all characters U, has an associated 
numerical code. Different code ranges are reserved for different languages. For 
Arabic, Unicode characters represent either letters or diacritical marks, with 
diacritics following letters in multiple permitted permutations. For the Quran, 
there have been proposals to extend Unicode to allow for more fine-grained 
representations. For example, Poumader (2010) suggests new characters to 
represent subtle variations in diacritics such as open tanwm and the combined 
versions of small waw used in Quranic script. Despite not implementing these 
extensions, the orthographic Tanzil data represents the Uthmani script with 
sufficient accuracy for the morphosyntactic annotation work in this thesis. 
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Character 


Glyph 


Character 


Glyph 


alif 


\ 






ba’ 




tatwil 


- 


ta’ 


O 






tha’ 




Small high sin 


O" 


jun 


a 






ha’ 

kha’ 


L 

t 


Small high rounded zero 


0 


dal 








dhal 




Small high upright rectangular zero 


- 


ra’ 


J 




f 


zayn 


j 


Small high mini (isolated form) 


sin 








shin 


lT 






sad 


lP 


Small low sin 




dad 








ta’ 


3, 


Small waw 


3 


dtha ’ 








‘ayn 


t 


Small yd’ 




ghayn 


l 




fa’ 


ci 






qdf 


3 


Small high nun 




kaf 


3 






lam 


J 


Empty center low stop 


- 


mini 


r 






nun 


j 


Empty center high stop 


0 


ha’ 




- 


waw 


J 


Rounded high stop with filled 


• 


ya 


L? 


center 




hamza 


f- 






alif maqsura 


3 


Small low mini 




ta ’ marbuta 


0 




7 



Table 4.1: Base characters in the orthographic representation. 
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Diacritic 


Position / description 


Glyph 


fatha 


Above 


- 


damma 


Above 


> 


kasra 


Below 


- 


fathatan 


Double fatha 




dammatan 


Double damma 




kasratan 


Double kasra 


- 


shadda 


Above 


4 


sukiin 


Above 


0 


madda 


Above 


T 


hamza above 


Above 1 


* 

\ 


hamza below 


Below 1 


\ 

t- 


hamzat wasl 


Above alif 


\ 


alif khanjanya 


Superscript alif 


- 



1 Diacritic hamza shown above/below alif for illustrative purposes. 



Table 4.2: Attached diacritics with their positions relative to base characters. 



For orthographic processing, JQuranTree does not use Unicode for two reasons. 
Firstly, locating a letter by ordinal position requires scanning up to that point in a 
verse, as diacritic sequences can have variable length, resulting in linear, instead 
of constant, time complexity. Secondly, characters such as alif and alif khan jariy a 
are in fact the same underlying Arabic letter with only a stylistic difference, and 
should be handled uniformly in tasks such as morphological analysis. Instead, 
JQuranTree uses a character-plus-diacritic representation. In this representation 
variations such as alif and alif khanjanya map to the same base characters with 
distinguishing marking features, simplifying text comparisons with diacritics. 
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The character-plus-diacritic representation uses two sets of glyphs. To define 
the representation, let B be the set base characters, and D be the set of diacritics. 
The set of base characters is derived from the Tanzil data and includes the letters 
and recitation marks used in the Quran (Table 4.1, page 64). The set of diacritics 
is shown in Table 4.2 (page 65). A string s of Arabic text is then formally defined 
as a sequence of compound characters, each of which is a base character (from B ), 
together with a set of zero or more attached diacritics (a subset of D): 



s = (c i, ..., c„) 

Ci = (hi, di) I bj £ B A di Q D ( 1 < i < n) 



An example of this representation for the third word of verse (70:8) is shown in 
Figure 4.3. This word is pronounced al-sama’u (‘the sky’). The diagram shows 
the word written in Classical Arabic script, followed by its composition into six 
base characters with diacritics attached to five of these. The lower part of the 
diagram shows the character-plus-diacritic representation as a list of pairs (b„ di): 




> 




/ 


* 






9 - 


\ 


f 


o* 


J 


\ 



( alif , {hamzat wasl}) 
{lam) 

{sin, {j fatha , shadda}) 
(mini, {fatha}) 
(alif, { madda }) 

(, hamza , { damma }) 



Figure 4.3: Character-plus-diacritic representation for Arabic script. 
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4.4 Computational Model 

4.4.1 Java Object Model 

JQuranTree uses Object Oriented Programming (OOP) to represent orthography. 
This is the computational design paradigm used for Java programming. Figure 4.4 
shows the classes used to implement the character-plus-diacritic representation. 4 




Figure 4.4: Class hierarchy for orthography in JQuranTree. 

These Java classes are based on the following definitions: 

Document: The Quran is modelled as a single text document. 
Chapter: One of the 114 numbered chapters in the Quran. 

Verse: One of the numbered verses in a chapter. 

Token: A whitespace-delimited span of text within a verse. 
Character: A base character from the set B in Table 4.1 (page 64). 
Diacritic: A diacritic from the set D in Table 4.2 (page 65). 

4 This implementation is freely available online: http ://corpus .quran.com/i ava 
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In Arabic computational processing, the term ‘token’ can have multiple 
meanings depending on the processing task, such as a word or its subdivisions. 
JQuranTree uses the term token to denote a whitespace-delimited run of text 
within a Quranic verse. These are often words, although in the Quran multiple 
words with different stems are occasionally fused as a compound word-form. 
Morphological segmentation for compound forms is discussed in Chapter 5. 

4.4.2 Location Notation 

The Quran is divided into 114 chapters, with each chapter divided into a sequence 
of numbered verses. The pair notation (c:v) is often used in scholarly works to 
reference verses within the Quran. For example (6:76) refers to verse 76, chapter 
6. This thesis extends this notation to tokens using the following definition: 



A location uniquely identifies a token as a triple ( c:v:t ) where c is a chapter 
number, v is a verse number, and t is a token number. 



The Location class in JQuranTree models this concept computationally. In the 
Quran Arabic Corpus, this notation is used to assign a unique reference number to 
tokens in the Quran, and appears in morphological and syntactic diagrams online. 
Location numbers are also used by annotators during online discussion to refer to 
particular parts of verses and chapters. They will also be used in the syntactic 
representation in Chapter 6, in which each token is annotated with its location 
number in the corpus. 



4.4.3 Internal Representation 

Internally, JQuranTree uses a byte-encoded representation for orthographic data 
that has been optimized for efficient access. This allows the morphological and 
syntactic algorithms described later in this thesis to rapidly process the Quranic 
text. As described in section 4.3, given a block of Unicode Arabic text with 
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diacritics, locating a letter by offset requires a linear-time scan, as sequences of 
diacritics are of variable length. The class hierarchy in JQuranTree allows access 
to individual Arabic letters. However, for the entire Quran, representing each 
letter with its own Java object would not be a memory-efficient approach. 

Both of these concerns are addressed by using a byte buffer, with a fixed width 
for each letter including its diacritics. In JQuranTree, character objects are a view 
on the buffer, and are created on demand and garbage collected. Each character is 
represented by three bytes. The first byte encodes the character type. The second 
and third bytes form a vector of bits. Each attached diacritic has a fixed position in 
the bit vector, and if the bit is set then the diacritic is present. The maximum range 
of values possible in this encoding scheme would be 256 types of base character, 
and combinations of 16 diacritic types. In practice, only 44 base character types 
and 13 diacritic combinations are used in Classical Arabic. 




> 








* 


' fa" 


J 


\ 



Character-plus-diacritics 


Byte 1 


Byte 2 


Byte 3 


(alif, {hamzat wasl}) 


0 


0 


2 3 


(lam) 


22 


0 


0 


(sin, { fatha , shadda }) 


11 


2° + 2 6 


0 


(mini, {fatha}) 


23 


2° 


0 


(alif, {madda}) 


0 


0 


2° 


(hamza, {damma}) 


28 


2 1 


0 



Figure 4.5: Internal orthographic encoding. 
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As an example, the upper part of Figure 4.5 (page 69) shows the character-plus- 
diacritic representation for token (70:8:3). The table in the lower part of the 
diagram shows the internal encoding. In contrast to Unicode, where multiple byte- 
encodings are possible, the token’s six characters and their attached diacritics are 
unambiguously represented using the following 24 bytes: 



(0, 0, 8, 22, 0, 0, 11, 65, 0, 23, 1, 0, 0, 0, 1, 28, 2, 0) 



The Quran contains 6,236 verses. Representing all orthographic data from the 
Tanzil project in Unicode would require 1,389,662 bytes (1.33 megabytes). The 
bit-packed representation used by the orthography model uses 1,242,006 bytes 
(1.18 megabytes). Dividing this by three, we get 414,002 characters for all verse 
text including whitespace, as the internal representation has a constant ratio of 
characters to bytes, regardless of the number of attached diacritics. 

4.4.4 Unicode Conversion 

Converting to and from Unicode is supported by JQuranTree to allow the 
Uthmani script to be loaded into the orthographic model, and for exporting Arabic 
text for display on the corpus website. The decoding process is reversible and is 
tested via the round trip method: a Unicode encoder is used to serialize the 
orthography model back into Unicode, and tests are run to ensure that the original 
character data is recovered and no orthographic information is lost. 

Unicode decoding (converting from Unicode into the character-plus-diacritic 
representation) is performed using table lookup. 5 For each Unicode character in 
the Uthmani script, the orthographic base character and diacritics are determined. 
Several Unicode characters may be decoded as a single orthographic base 
character. If table lookup results in a character, then a new base character is 



5 http://corpus.quran.com/iava/unicode.isp 
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formed. Otherwise, if the lookup results in only a diacritic, then that diacritic 
marker will be combined with the previous base character. 

Unicode encoding (converting from the character-plus-diacritic representation 
into Unicode) is more complex than decoding. A given subset of the orthographic 
model could have multiple representations in Unicode. This is not only because 
Unicode allows combining marks to be ordered arbitrarily, but also because 
certain combinations of letters and diacritics (such as alif and hamza ) have an 
alternative representation as a single pre-composed Unicode character. 

The encoding algorithm is shown in Figure 4.6 below. The algorithm’s steps 
ensure that round trip testing is possible. Given Tanzil orthographic data, the 
original sequence of Unicode characters will be recovered after deserializing then 
reserializing. The algorithm uses the same conversion table for decoding so that 
Unicode serialization is perfectly reversible. 



For each compound character in the representation: 

Step 1: If the base character has a diacritic that forms a well-known 
combination, then map this to a single Unicode character. If {hamza above} 
was the diacritic used, then remove this from the list of diacritics to consider. 
The six well known combinations are: (alif / waw / yd ’, {hamza above}), (alif, 
{hamza below}), (alif {hamzat wasl}), (alif {khanjarTya}). 

Step 2: If Step 1 did not apply, then use the conversion table to determine the 
Unicode character to use for the base character, without its diacritics. 

Step 3: Use the conversion table to form Unicode characters out of any 
remaining diacritics in the following order: {hamza above}, {shadda}, 
{fathatan}, {dammatan}, {kasratan}, [fatha], {damma}, {kasra}, {sukiin}, 
{madda}. 



Figure 4.6: Unicode encoding algorithm. 



71 





4 - Orthographic Representation 



4.4.5 Extended Buckwalter Transliteration 

In addition to Unicode conversion, JQuranTree supports converting to and from 
Buckwalter transliteration. This is an ASCII-based encoding scheme that is fully 
reversible, so that no information is lost during transliteration. A reversible 
transliteration scheme can be used for precisely specifying the orthography of 
Arabic words in computational work. The BAMA system described in section 
2.2.1 stores its morphological lexicon in this format, and this data will be used in 
Chapter 7 for Classical Arabic annotation. 

JQuranTree extends Buckwalter’s scheme to include additional symbols in the 
Uthmani script. Four non-Arabic characters in the original scheme (not found in 
the Quran) are used for dialects and foreign words: P (peh ), J ( tcheh ), V (veh) and 
G ( gaf ). The combination character (alif. {madda}), encoded as a vertical bar ‘|’> 
is also not used in the Tanzil orthographic data. These characters are not 
implemented by JQuranTree. Similarly, 14 Quranic symbols do not feature in the 
original scheme. In the extended scheme these are assigned to ASCII punctuation 
marks, which is unambiguous as modem punctuation does not occur in the Quran. 

Table 4.3 (overleaf) shows the additional characters. As an example, token 
(19:7:6) in the Quran is the proper noun Yahya ((s4s), which would be encoded as 
yaHoyaY'. The Token class in JQuranTree implements this conversion process. 
Figure 4.7 shows an example Java program for accessing this implementation: 



public class BuckwalterExample { 
public static void main () { 

Token token = Document . getToken ( 1 9 , 7, 6); 
System. out .println (token . toBuckwalter ( ) ) ; 

} 

} 



Figure 4.7: Example JQuranTree program for Buckwalter transliteration. 
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Symbol 


Encoding 


Madda 


A 


hamza above 


# 


Small high sin 




Small high rounded zero 


@ 


Small high upright rectangular zero 


M 


Small high mim (isolated form) 


[ 


Small low sin 


? 


Small wdw 


? 


Small yd ’ 




Small high nun 


! 


Empty centre low stop 


- 


Empty centre high stop 


+ 


Rounded high stop with filled center 


% 


Small low mini 


] 



Table 4.3: Additional characters in extended Buckwalter transliteration. 



4.4.6 Orthographic Search 

JQuranTree implements the class TokenSearch for orthographic search. This finds 
all tokens that match an orthographic form specified using extended Buckwalter 
transliteration and is useful for tasks such as implementing a concordance. Figure 
4.8 (overleaf) shows an example Java program that uses this class to find 
occurrences of the orthographic form qamar (the word ‘moon’) in the Quran. 
When run, this program will display all exactly matching surface forms (j^). 
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public class TokenSearchExample { 

TokenSearch search 

= new TokenSearch (EncodingType . Buckwalter) ; 
search . findSubstring ( "qamar" ) ; 

System. out . print In ( search . getRe suits ( ) ) ; 

} 



Figure 4.8: Orthographic search using Buckwalter transliteration. 



Because orthographic search is used to find tokens that match a specific 
spelling with diacritic markers, this type of search is used to find exact matches 
regardless of morphological inflection. Online, the corpus website extends this 
search to provide users with a more flexible search based on matching lemmas, 
parts-of-speech tags and morphological features (described in section 8.4.2). 

4.5 Conclusion 

The Uthmani script of the Quran has complex orthography and includes additional 
characters and markings not used in Modern Arabic. These include verse pause 
marks for specifying detailed pronunciation, and diacritical marks used to indicate 
inflection as part of Arabic’s morphological and syntactic rules. 

This chapter described a formal orthographic representation for the Quran, as 
well as JQuranTree, the representation’s realization as a computational system. To 
represent the Quranic text, orthographic data from the Tanzil project was used 
(Zarrabi-Zadeh, 2011). This work was required to unambiguously represent the 
Classical Arabic script of the Quran in a computational system, so that no 
orthographic information is lost during processing. JQuranTree is made freely 
available online as an open source project for accessing and searching the original 
Arabic text of Quran. The orthographic model presented here will be next used for 
the morphological representation described in the following chapter. 
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The Semitic root is one of the great miracles 
of man’s language. 

- Johannes Lohmann 



5 Morphological Representation 

5.1 Introduction 

This chapter describes the formal representation used to develop morphological 
annotation in the Quranic Arabic Corpus. The representation provides a model for 
Classical Arabic word structure that is designed to be fine-grained and suitable for 
statistical parsing. Computationally, the formalism is based on the lexeme-plus- 
feature representation reviewed in section 2.2.2 (Habash, 2007a) for two reasons. 
Firstly, analyzing word structure using a lemma and a set of features is an 
intuitive approach to Arabic morphology that is easily understandable by 
annotators. Secondly, the feature-value data structures in Habash’s representation 
are directly applicable to machine learning and parsing work. 

However, the representation described in this chapter differs in several respects. 
Following the direction taken by Sawalha and Atwell (2013), a more fine-grained 
approach is used for Arabic morphology. As described in the literature review, 
annotating a set of detailed morphological features during treebank construction 
improves parsing accuracy. Another difference is that Habash’s scheme is 
designed for Modem Arabic. For Classical Arabic, different features and part-of- 
speech tags are used that more closely align the representation to traditional 
sources. Finally, an alternative segmentation scheme is used that is better suited to 
the Quranic text. Inspired by recent computational work for Arabic morphology 
by Smrz (2007) and Habash (2007a; 2010), both form and function are modelled. 
Form is modelled by segmenting Arabic words into their constituent morphemes. 
Function is modelled by associating a set of morphological features with each 
segment, such as person, gender, number and syntactic inflection features. 
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The remainder of this chapter is organized as follows. Section 5.2 provides an 
overview of Classical Arabic morphology and defines key terminology. Section 
5.3 provides a formal description of the representation. Sections 5.4, 5.8 and 5.9 
describe the part-of- speech tagset, the feature set and the segmentation scheme 
respectively. Section 5.10 compares formal representations of Classical Arabic 
morphological structures to traditional analyses and section 5.11 concludes. 

5.2 Classical Arabic Morphology 

5.2.1 Traditional Morphological Analysis 

Classical Arabic is a morphologically-rich language with complex word structure. 
In traditional Arabic grammar, morphological analysis is a well-established field 
of study known as sarf (<-a j^>), which has been continuously developed from the 
start of the Arabic linguistic tradition by grammarians. Prominent examples 
include Sibawayh (760-796), who devoted half of al-kitab to the subject. He 
described Arabic’s inflectional and derivational processes, as well as its root and 
pattern system (Carter, 2004). Al-Farra (731-822) and Al-Akhfash (d. 830) each 
wrote linguistic works focused entirely on morphological analysis. Ibn Jinni (932- 
1002) further developed the field, and was the first Arabic grammarian to 
explicitly define the difference between morphology and syntax, famously stating: 



Morphology deals with the form of words, while syntax studies words in 
their different contexts. 



By the time of the grammarian Ibn Mas’ud (ca. 1250-1300), morphological 
analysis for Classical Arabic was highly developed, and on par with modem 
linguistic work. His treatise marah al-arwah contained detailed descriptions of 
verb and noun patterns, providing phonological and semantic context for Arabic’s 
rich morphology, building on a large body of previous work (Akesson, 201 1). 
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Concepts from Classical Arabic morphology are also applicable to Modern 
Arabic, as both forms of the language share a common morphological system. 
However, there are distinctions between the two. For example, in spoken Modem 
Arabic inflection is simplified and case endings are generally omitted, whereas 
Classical Arabic is fully vocalized. Similarly, Classical Arabic has a richer set of 
particles that are used as concatenative prefixes, such as the hamza of equalization 
s j*&), requiring a different set of segmentation rules. 

5.2.2 Roots and Patterns 

A distinguishing feature of Arabic, and other Semitic languages such as Hebrew, 
is nonconcatenative morphology (Habash, 2007; Boudelaa and Marslen-Wilson, 
2001). Most Arabic words can be structured as the combination of two abstract 
morphemes: a lexical root and a template pattern. This approach is termed 
nonconcatenative because the root’s letters are not always found consecutively in 
derived words. The use of roots and patterns was an early development in the 
Arabic linguistic tradition (Muhammad, 2007; Versteegh, 1997b). For Modem 
Arabic, this has remained the standard approach in morphological analysis (Mace, 
2007; Wightwick and Gaafar, 2008). For example, both Classical and Modem 
Arabic dictionaries are organized by root. For the purposes of computational work 
in this thesis, the following definitions will be used for Classical Arabic: 

A root ( jithr - jAq is a sequence of three or four consonants (known as 
radicals) that is used to derive a group of related words. These sequences are 
known as triliteral and quadriliteral roots respectively. 

A pattern (wazn - uJ j) is a template consisting of consonants and vowels 
together with placeholders for a root’s radicals. 

Derivation ( ishtiqaq - o^') is the morphological process in which a root 
in combination with a pattern generates a derived word. 
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The nonconcatenative system for word generation in Arabic is well developed. 
Several hundred patterns in combination with thousands of roots allows for a large 
number of possible derived words, although in practice the number of roots is 
limited. For Classical Arabic, Lane’s Lexicon lists 3,775 roots based on traditional 
sources (Lane, 1992). A more comprehensive Classical Arabic dictionary is l is an 
al- ‘arab (sod! jLd) by Ibn Manzur (1233-1312). Hegazi and El-Sharkawi (1985) 
estimate that the lexicon contains 6,350 triliteral roots and 2,500 quadriliteral 
roots, although only 1,200 of these are still used in Modem Arabic. For Modem 
Arabic as a whole, Ryding (2005) estimates that between 5,000 and 6,500 roots 
are currently in use. 

In both varieties of Arabic, roots are used to form words with related meanings. 
For this reason, a root is said to generate a semantic field (Badawi and Haleem, 
2008). The canonical example used to illustrate this is the root ka ta ba (<r> ^ d), 
used in both Classical and Modern Arabic. This root generates the verb ‘write’ 
(, kataba - s-uS) and the nouns ‘writing’ ( kitabah - ‘writer’ ( katib - s^), 

‘book’ ( kitab - m^) and ‘desk / office’ ( maktab - -A^). In traditional analysis, 
the patterns used to derive these words are specified using the placeholder letters 
fa ’ ‘ayn lam ( J £ 1 -‘). For example, the pattern for katib (s^) is fa ’il ( Jc-la), a form 
I active participle. In the Quranic Arabic Corpus, root tagging is the basis for 
further annotation including derived and inflectional morphological forms. 

5.2.3 Inflection and Concatenation 

In Arabic, derived words can undergo two changes before appearing in their final 
surface form, due to semantic and syntactic context: 

Inflection ( tasrif - '-A j^) is the morphological process in which the form 
of a word is modified by grammatical attributes or syntactic function. 

Concatenation is the morphological process in which the form of a word is 
modified by attaching prefixes and suffixes. A stem is the part the word to 
which prefixes and suffixes are attached. 
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In the process of inflection, words are modified by grammatical attributes. For 
example, the masculine form for teacher, mu ’alim becomes mu ’alimah 

in the feminine. Relevant to parsing work, words are also inflected for 
syntactic function through case endings. In morphological concatenation, words 
are further modified by attaching prefixes and suffixes. Unlike in English, where 
the syntactic unit is primarily the word, in Arabic, stems, prefixes and suffixes are 
units for syntactic analysis, requiring decomposition as a prerequisite for parsing: 



Segmentation is the reverse process of concatenation. 

Morphological segments are the concatenative morphemes that result from 
segmentation. These are stems, prefixes and suffixes. 



To illustrate these concepts, Figure 5.1 below shows token (14:22:30) from the 
Quran. This compound word jL^ (translated as ‘with your helper’) exhibits 
rich morphology. Its surface form ( bimus ’ rikhikum ) is a concatenation of a 
prefixed preposition (/?/), a stem (a form IV active participle - nuts ’ rikh ) and a 
suffixed pronoun ( kum ). The stem’s surface fonn is related to its syntactic 
function. Due to the prefixed preposition, the stem is inflected for the genitive 
case (mus ’rikhi). Figure 5.2 (overleaf) shows how the word is composed through a 
combination of derivation, inflection and concatenation. 



(14:22:30) 
bimus ’rikhikum 
‘with your helper’ 




Figure 5.1: A compound Classical Arabic word-form in verse (14:22). 
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Figure 5.2: Derivational and inflectional morphology with form and function. 
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5.2.4 Lemmas 

In Arabic, a root gives rise to a group of derived words with related meanings. 
Each of these derived words gives rise to a secondary group of words that differ 
only by inflection. In Arabic lexicographic analysis, this inflection group is 
known as a lexeme: 



A lexeme is a group of words with the same derivational morphology that 
differ only by inflection. 

A lemma (also known as a citation form) is a conventional choice of one 
word that represents a lexeme. 



Both Modern and Classical Arabic dictionary entries are organized by root then 
lemma, but stop short of enumerating inflected or concatenated forms due to the 
large number of inflection patterns. 

5 . 3 Formal Representation 
5.3.1 Segmentation 

This section formalizes Classical Arabic morphological structures by extending 
Habash’s lexeme -plus-feature representation for Modern Arabic (Habash, 2007a). 
This is based on the concept of using a lemma and a set of feature-value pairs. In 
contrast to Habash’s work, the representation here supports multiple stems. This is 
due to the frequent occurrence of contractions in Classical Arabic script, such as 
the fused word-form ‘about-what’ (‘amnia - fc.) consisting of the particles ‘about’ 
( ‘an - lA) and ‘what’ (ma - h»), each with a distinct stem and syntactic function. 
For this reason, the lemma and features are attached to individual morphological 
segments, instead of the word-level attachment in Habash’s scheme. As a 
consequence, each segment in a Classical Arabic word has its own part-of-speech. 



81 




5 - Morphological Representation 



The first part of the formalization describes segmentation. A token was defined 
in Chapter 4 as a whitespace-delimited span of text. This is either a single stem or 
a compound word-form constructed by concatenating multiple segments. Using 
the orthographic representation from section 4.3, a token w is a sequence of base 
characters with attached diacritics: 



w = (a , ..., c„) 

Cj = ( bj , di) I bj £ B A d, Q D ( 1 < i < n) 



Morphologically, a token is partitioned into a sequence of m segments. Let each 
segment s-, ( 1 < i < m) span base characters in the token from positions S(i) to E(i). 
The following constraints are used to ensure that the partition covers the entire 
token continuously: 



w = Oi, ..., s m ) 

5(1) = 1 A E(m) = n 
S(i +1) = E(l) + 1 (1 < i < m) 
E(i) > S(i) (1 <i<m) 



This definition of segmentation applies to all segments except those of zero- 
length. These are abbreviated suffixed pronouns represented by a diacritic, such as 
(3:35:5) rabbi (40) - ‘my lord’. This special case is described in section 5.9. 

5.3.2 Feature- Value Pairs 

The representation associates a set of feature-value pairs with each morphological 
segment in a token w = (s\, ..., s m ). Formally, a feature is a function that maps a 
segment to a feature-value: 
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fj(si) £Fj (1 <i< m, 1 <j < M) 



Here M is the number of features in the representation and Fj is the set of 
possible values for feature^ In the annotation scheme, the term ‘feature’ is used 
in a functional sense. These include segment type (stem, prefix or suffix), root, 
lemma and grammatical features such as person, gender and number. 

5.3.3 Feature Notation 

The Quranic Arabic Corpus uses a formal notation for morphological annotation, 
written as a sequence of tags in square brackets. Morphologically annotated data 
is stored in the corpus database using this format. Each tag either starts a new 

segment, or describes a feature-value pair associated with the previous segment. 

> 0 

For example, the compound word-form bimus’rikhikum in Figure 5.1 

(page 79) is tagged as: 



[bi+ POS:N ACT PCPF (IV) FEM:muSorix ROOT: Six M GEN PRON:2MP] 



In this example, the symbol bi-i- is the prefixed preposition hi. POS:N is a noun 
(a stem) followed by derivation features (active participle, form IV). The next two 
features are the stem’s lemma and root specified using Buckwalter transliteration, 
followed by inflection features for masculine and the genitive case. The symbol 
PRON:2MP is a suffixed second person masculine plural pronoun. These tags 
correspond to the morphological analysis in Figure 5.2 (page 80). This notation is 
designed to be machine -readable but is also purposefully verbose so that 
annotators do not have to frequently consult annotation guidelines to look up the 
meaning of tags. The remainder of this chapter describes the part-of-speech tags 
and morphological features for Classical Arabic in more detail. 
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5.4 Parts of Speech 

5.4.1 The Part-of-Speech Hierarchy in Arabic Grammar 

In traditional Arabic grammar, parts of speech are organized into a hierarchy 
consisting of three main classes that are divided into subclasses (Owens, 1989). 
The main classes are nominals (ism - <~4)> verbs (fi’il - J*i) and particles (harf— 
j^). This classification was introduced at the beginning of the Arabic linguistic 
tradition. For example, Sibawayh’s kitab opens by establishing that the topic of 
his book is speech ( kalam ) and that speech is divided into three main categories. 
He divides the class of nouns into subclasses including explicit nouns and 
pronouns, and organizes the class of particles by their syntactic function (Carter, 
2004; Baalbaki, 2008). Later grammarians refined these subdivisions, such as Ibn 
Hisham who developed a detailed classification of particles according to syntactic 
and semantic usage (Gully, 1995). 

However, a frequent simplification for certain computational tasks is that 
Arabic has only three parts of speech. In contrast to the work in this thesis, several 
Arabic computational systems have previously relied on only the three main 
classes. Examples of underrepresentation includes parsing work by Mehdi (1985) 
and Shokrollahi-Far at al. (2009), verbal representations by Islam et al. (2010) and 
stemming work for information retrieval by Moukdad (2006). As noted by Attia 
(2008), the simplification that Arabic has only three parts of speech arises by only 
considering the main classes and not their subdivisions: 



It is quite surprising to see many morphological analyzers today influenced 
by the misconception that Arabic parts of speech are exclusively nouns, 
verbs and particles. The Xerox Arabic morphological analyzer is a good 
example of this limitation (Beesley, 2001). In Xerox morphology, words are 
classified strictly into verbs, nouns and particles; no other categorical 
description is used. 
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In deeper computational analysis, such as the work presented in this thesis, part- 
of-speech tagsets are more fine-grained. Other examples of rich tagsets for Arabic 
include the Penn Arabic Treebank tagset by Buckwalter (2002), the Prague Arabic 
Dependency Treebank tagset by Hajic et al. (2004), and the theory-neutral tagset 
by Sawalha and Atwell (2010). Modern Arabic computational work often cites 
traditional grammar as a source of inspiration. For example, the tagger developed 
by Khoja (2001) uses a tagset based on traditional sources: 



Since the grammar of Arabic has been standardized for centuries, [the 
tagset] is derived from this grammatical tradition rather than from an Indo- 
European based tagset. Arabic grammarians traditionally analyze all Arabic 
words into three main parts-of-speech. These parts-of-speech are further 
subcategorized into more detailed parts-of-speech which collectively cover 
the whole of the Arabic language. 



5.4.2 Part-of-Speech Analysis in al-i’rab al-mufassal 

For Classical Arabic part-of-speech tagging, the Quranic Arabic Corpus uses as its 
primary reference al-i’rab al-mufassal likitab allah al-murattal (‘A Detailed 
Grammatical Analysis of the Recited Quran using i’rab ’) (Salih, 2007). Because 
this work builds on multiple sources, it provides morphological and syntactic 
analysis for the entire Quran. Salih provides more detail in comparison to related 
works such as Darwish (1996), who instead provides more concise grammatical 
analysis alongside exegetic commentary. 

Developing a part-of-speech tagset using Salih as a reference is complicated by 
several factors. Firstly, he does not list or define his grammatical terminology, 
assuming the reader has expertise with traditional grammar and is familiar with its 
conventions. At over 10,000 pages of prose, the reference work is also lengthy, 
using alternative terminology in different places. Finally, the text is not available 
in an easily machine-readable form, making automatic extraction of its analyses 
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unviable. Consequently, deriving a complete listing of grammatical terminology 
in Salih’s work is only possible by reviewing the complete text. 

The part-of-speech tagset presented here is based on a manual review of Salih’s 
analysis. During this review, the key terms for parts-of-speech, morphological 
features and syntactic constructions were documented and compared to Darwish’s 
terminology. The two works were found to use essentially the same standardized 
terms. However, although both works primarily use Basran terminology, Salih 
also uses Kufan. For example, he often uses the Kufan term na ’t (^A*j) alongside 
the Basran sifa (<*“=) for adjectives (Carter, 2000). An example of Salih’s analysis 
for verse (77:21) is shown in Figure 5.3 below. This provides morphological 
analysis with segmentation and part-of-speech tagging, together with a description 
of syntactic structure: 6 






«li»j . lq o . Aidalc- I 

j jS? ij, jjSwJl J^2X« 

. Aj (J^SULO 



'■ ^\ 1 a ^ls>- • (3 

. 1) 



Figure 5.3: Salih’s grammatical analysis for verse (77:21). 



6 Salih (2007). Volume 12, page 297. 
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In his morphological analysis, Salih’s divides the first word in the verse (»Ul*ai) 
into four segments: a prefixed conjunctive particle <4ill), a verb (o^-* J»i) 5 a 

suffixed subject pronoun (J^li j j^a) and a suffixed object 

pronoun (*4 J ^ j^a). The second and third words in the verse 

are described as a prepositional phrase jW-). This concise analysis 

assumes that the reader is sufficiently familiar with traditional grammar to 
understand that these two words are a preposition and a noun respectively. 
Finally, the last word of the verse is tagged as an adjective (^*j - 

5.4.3 Part-of-Speech Tags for Classical Arabic 

The complete part-of-speech tagset adapted from Salih’s analysis contains 44 tags 
(Table 5.1, overleaf). In the table, tags have been organized into a hierarchy with 
three levels. The first level (column one) consists of the three main parts-of- 
speech from traditional grammar: the nominals (ism - verbs (fi ’il - J*a) and 
particles (half - ^ j*). The second level (column two) is an intermediate category. 
The third level in the tagset consists of the fine-grained parts-of-speech used to 
tag morphological segments (columns three to five). Only part-of-speech tags 
from this level are stored in the corpus database. The other two levels are abstract 
groups that are used to describe morphology and parts-of-speech in general terms. 

The last two columns in Table 5.1 provide descriptions using both English and 
Arabic terminology. For Arabic, Salih’s most commonly used term is listed for 
each part-of-speech. For English, equivalent terminology for nominal tags was 
derived by comparing three Classical Arabic reference grammars and selecting the 
most suitable translation based on Salih’s usage of each term (Wright, 2007; 
Haywood and Nahmad, 1990; Fischer and Rodgers, 2002). For particles, 
terminology from Gully (1995) was adapted by comparing to the dictionary of 
Quranic usage by Badawi and Haleem (2008). 

Figure 5.4 (page 89) shows example morphological segmentation and part-of- 
speech tagging for verses (1:1-7) of the Quran. The next three sections describe 
the part-of-speech tagset for Classical Arabic in more detail. 
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Class 


Subclass 


Tag 


Description 


Arabic Term 




Nouns 


N 


Noun 






PN 


Proper noun 


( alc. 




Derived nominals 


ADJ 

IMPN 


Adjective 

Imperative verbal noun 




Nominals 




PRON 


Personal pronoun 






Pronouns 


DEM 


Demonstrative pronoun 








REL 


Relative pronoun 






Adverbs 


T 


Time adverb 


ljU j ^ 




LOC 


Location adverb 




Verbs 


Verbs 


V 


Verb 






Prepositions 


P 


Preposition 








EMPH 


Emphatic lam prefix 






lam prefixes 


IMPV 


Imperative lam prefix 


J-VI fV 






PRP 


Purpose lam prefix 






Conjunctions 


CONJ 

SUB 


Coordinating conjunction 
Subordinating conjunction 


l 0 1— ^ C~- L Q^\ 






ACC 


Accusative particle 


(. b-aj 






AMD 


Amendment particle 








ANS 


Answer particle 








AVR 


Aversion particle 








CAUS 


Particle of cause 


4 \\ U L Q^\ -s. 






CERT 


Particle of certainty 


}'q Vi 






CIRC 


Circumstantial particle 








COM 


Comitative particle 








COND 


Conditional particle 








EQ 


Equalization particle 


4 \ ^ i iG i. 


Particles 




EXH 


Exhortation particle 






EXL 


Explanation particle 


i- Q^\ -v 






EXP 


Exceptive particle 


c-UjILujI etal 




Other particles 


FUT 


Future particle 


(JUilLuji s__a 






INC 


Inceptive particle 








INT 


Particle of interpretation 


^-LuiSJ i. Q^\ 






INTG 


Interrogative particle 


^ Q*i t .'1 i. 0^ -s. 






NEG 


Negative particle 








PREV 


Preventive particle 








PRO 


Prohibition particle 








REM 


Resumption particle 


4 iVi t .'1 i Q^\ -s. 






RES 


Restriction particle 








RET 


Retraction particle 








RSLT 


Result particle 








SUP 


Supplemental particle 


^j| j 






SUR 


Surprise particle 








VOC 


Vocative particle 






Quranic initials 


INL 


Disconnected letters 


4 x 



Table 5.1: Part-of-speech tags for Classical Arabic. 
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3 j xLm -iiQ 

1 r, iTi' 



dnS ^ ^ ^ — ix- - . 



(1:1) Zzz/P .v ’zzzz/N alldh.il PN al/DET rahmdni/ ADi all DET rahimil ADJ 

(1:2) a//DET hamdul N li/P Z//z/zz7PN rabbi/N al/DET ‘alamlnalN 

(1:3) r/Z/DET rahmdni/ A DJ rz//DET rahunil AT)} 

(1:4) malikil N yawmif N «//DET dlnif N 

(1:5) zyydAa/PRON na’buduN wvz/CONJ zyyaAa/PRON nasta’inuN 

(1:6) zTz V///V z?/z/PRON rz//DET .s'/rato/N cz//DET zzzzz.s’ ’tac/Tina/A D J 

(1:7) yzzrzfa/N alladhma/KEL an ’amN ta/PRON ‘alayl P /zzzzz/PRON g/zzzyz^/N 
rz//DET maghdubi/N ‘alay/P /zzzzz/PRON wrz/CONJ /d/NEG rz//DET dalvial N 

Figure 5.4: Uthmani script and part-of-speech tagging for verses (1:1-7). 



5.5 Nominals 

The term ism (^') in Arabic linguistics is an autohyponym, used by traditional 
grammarians to refer to one of the three main parts-of-speech, as well as one of its 
subclasses. These two cases are distinguished in Arabic computational work by 
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using the term ‘nominal’ for the general class, and the term ‘noun’ for the specific 
subclass (Diab, 2007; Smrz, 2007; Habash and Roth, 2009c). In the Quranic 
corpus, nine tags are used for nominals: POS:N and POS:PN for nouns and proper 
nouns, POS:PRON, POS:DEM and POS:REL for personal, demonstrative and 
relative pronouns, POS:ADJ for adjectives, POS:LOC and POS:T for adverbs of 
place and time, and POS:IMPN for the imperative verbal noun. 

5.5.1 Nouns 

In Arabic grammar, words are classified as nouns (POS:N) primarily according to 
syntactic criteria (Owens, 1989). For example, Al-Zajjaji (892-951) defined a 
noun as a word occurring as the subject or object of a verb. Ibn Jinni (932-1002) 
included the more specific criteria that nouns are words placed into the genitive 
case by prepositions (j?- ^ >). Remarkably similar criteria are used in modem 
linguistics to define nouns. For example, Loos et al. (2004) propose the universal 
definition that nouns are words acting as the subjects or objects of verbs, or as the 
objects of prepositions or postpositions. 

5.5.2 Proper Nouns 

Classical Arabic script makes no orthographic distinction between nouns and 
proper nouns (<dt ^ I), unlike English where capitalization is used. However, most 
proper nouns (tagged as POS:PN) have the grammatical property that they are 
definite without having to carry the al- determiner prefix. Many proper nouns in 
the Quran are of a foreign or ancient origin. Morphologically, these fall outside 
the root and pattern system and are subject to restricted inflection rules. For 
example, the name Aaron ( harm - ujj 1 -*) is a diptote (> £>**) and has 

same inflected case-ending for both the genitive and accusative case. Although 
Salih flags diptotes, he does not generally indicate which nominals are proper 
nouns. A prominent exception to this is the name Allah (4il), which is referred to 
as lafth al-jalalah A2), literally ‘the majestic name’. 
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5.5.3 Personal Pronouns 

In traditional grammar, personal pronouns (POS:PRON), are classified into two 
types. Suffixed pronouns are known as damlr muttasil ( j^=). These require 
segmentation for annotation, described further in section 5.9. The second type are 
separate words known as damlr munfasil (ij^ia j^a), forming a small closed 
class of inflected forms (Table 5.2). In Arabic, personal pronouns include forms 
not found in English, such as the second person dual antuma (‘you two’). To 
simplify the segmentation process, members of the lexeme iyya (4|), such as the 
third person masculine singular form iyyahu (°4l), are also tagged as POS:PRON 
and annotated as a single word. These are known traditionally as damlr nasb 
munfasil ( jja^=), and are syntactically used as objects. 



Person 


Singular 


Dual 


Plural 


First 




(none) 




Second 


Masculine 




$ 


it 

r' 


Feminine 


cA 


,-.t 


Third 


Masculine 






> 


Feminine 




> 



Table 5.2: Independent personal pronouns. 



5.5.4 Demonstrative Pronouns 

Demonstrative pronouns are known as ism ishara (SjUil <~d) and are tagged as 
POS:DEM. Traditional grammarians distinguish between demonstratives used for 
objects that are near ( ism ishara lilqarib - ® j'-ii <~d) and far ( ism ishara 

lilba ’id - ® ^')- The same distinction is found in other languages such as 

English. The main inflection forms are shown in Table 5.3 (overleaf). 
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Type 


Number 


Gloss 


Gender 


Form 


Near 


Singular 


this 


Masculine 


Ijjti 


hadha 


Feminine 


ala 


hadhihi 


Dual 


these (two) 


Masculine 


j]la 


hadhani 


Feminine 


jla 


ha tan i 


Plural 


these (all) 


All 


djA 


ha ’ula 7 


Far 


Singular 


that 


Masculine 


di 


dhalika 


Feminine 


db 


tilka 


Dual 


those (two) 


Masculine 


db 


dhanika 


Feminine 1 


db 


tanika 


Plural 


those (all) 


All 


di/i 


ula ’ika 



1 This inflected form is not used in the Quran. 



Table 5.3: Main inflection forms for demonstrative pronouns. 

5.5.5 Relative Pronouns 

Relative pronouns (POS:REL) are known as ism mawsul (Jy^y d) in Arabic. 
Syntactically, these connect a relative clause to its main clause. Certain words 
such as inflected forms of alladhl (id 1 ) are easily tagged as relative pronouns as 
this is their main part-of-speech. Other relative pronouns include man (t>) and md 
(C>). However, because these two words frequently occur in more than one 
grammatical category, syntactic context is required to choose the correct part-of- 
speech tag. For example, the word md (‘what’) is tagged as POS:REL in verse 
(109:2): la a budu md tabuduna (dd ^ di 7?) - ‘I do not worship what you 
worship.’ In contrast, md (‘what’) is tagged as an interrogative (POSTNTG) in 
verse (99:3): waqala al-insdnu md laha (d 5’— A 1 Jta}) - ‘And man says, “What 
is [wrong] with it?”’. 
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5.5.6 Adjectives 

Adjectives ( sifa - ^>) are tagged as POS:ADJ and are closely related to nouns 
(POS:N). Without context, it can be difficult to distinguish the two as both occur 
with similar morphological features. For example, both can carry the prefix al- 
(‘the’). For this reason, adjectives are tagged according to syntactic criteria. In 
Classical Arabic, an adjective appears after the noun it describes, and is subject to 
a set of grammatical agreement rules. An example is the two-word verse (101:11) 
which consists of a noun followed by an adjective. Both words are indefinite and 
in the nominative case: na’run hamiyatun (V?A jti) - ‘a blazing fire’. 

5.5.7 Adverbs 

The term ‘adverb’ is used to describe a variety of grammatical categories in part- 
of-speech tagsets for English, with different classifications used for different 
tagged corpora (Atwell, 2008; Nancarrow, 2011). For part-of-speech tagging in 
the Quranic Arabic Corpus, the term is specifically used for the adverbs of place 
(POS:LOC) - dharf makan (J-A ^A>) and the adverbs of time (POS:T) - dharf 
zaman (u^O ^ A). These usually appear in adverbial expressions in the accusative 
case. For example, ward' a (‘behind’) is tagged as POS:LOC in verse (84:10): wa- 
amnia man utiya kitabahu wara'a dhahrihi (*j$A jj ^j) - ‘But as for 

he who is given his record behind his back’. Similarly, ahqaban (‘ages’) appears 
in the accusative case and is tagged as POS:T in verse (78:23): labithma fiba 
ahqaban (2UAi jAV) - ‘In which they will remain for ages’. 

5.5.8 Imperative Verbal Nouns 

Salih uses the grammatical term ism fi’il ‘amr (>4 JA <*A) in only a few places in 
the Quran. In the Quranic Arabic Corpus, these words are tagged as imperative 
verbal nouns (POS:IMPN). For example, this tag is used for the word misasa 
(Q4-A) in verse (20:97). In this context, the word appears as a nominal, yet has an 
imperative meaning: la misasa (<>A? V) - ‘do not touch’. 
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5.6 Verbs 

Verbs are one of the three main parts-of-speech in traditional Arabic grammar, 
and are known as fi’il (lM). Historically, grammarians classified words as verbs 
primarily using semantic and morphological criteria. For example, Al-Zajjaji 
defined a verb semantically as a word that represents past, present and future 
actions. Ibn Hisham defined a verb morphologically as a word derived from a root 
using a well-known verbal pattern (Owens, 1989). In the Quranic Arabic Corpus, 
verbs are annotated using the POS:V tag. Morphological features are used to 
subclassify verbs according to their template pattern, inflection attributes and 
syntactic group. For example, verbs in the group known as kana wa akhwatuha 
jl£) are tagged as POS:V together with a feature marker. In contrast, 
nominals derived from verbs, such as participles, are tagged as either POS:N or 
POS:ADJ according to their syntactic usage. 

5.7 Particles 

In traditional Arabic grammar, a word is classified as a particle, harf if it is 

neither a nominal (^') nor a verb (J*a). In contrast to previous tagged Arabic 
corpora, the Quranic Arabic Corpus provides deep annotation of particles using 34 
tags. In the tagset hierarchy, particles are subclassified into Quranic initials 
(POS:INL), prepositions (POS:P), conjunctions (POSiCONJ and POS:SUB), 
prefixed lam particles (three additional tags), and other particles (27 tags). 

5.7.1 Quranic Initials 

Quranic initials, huruf muqatta ’ah (kxLL> i_*jj*.), are sequences of disconnected 
letters, such as alif lam man (? J '), that appear at the start of several chapters in 
the Quran. Their interpretation has no firm consensus in Quranic exegesis, and in 
Islam their meaning is generally considered to be a divine secret (Shahid, 2000). 
As their grammatical function is not specified, they are tagged as a separate part- 
of-speech (POSTNL). 
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5.7.2 Prepositions 

Prepositions (POS:P) are known as half jar (j=- ^). They precede nominals, 
placing them into the genitive case. Independent prepositions include ‘ ala 
and fl (iP), usually translated as ‘on’ and ‘in’ respectively. POS:P is also used to 
tag vowelized prepositional prefixes, including ba’ (m), kaf (^), ta’ (^), waw (j), 
and one of the senses of lam (J). In contrast to Modem Arabic which has a 
reduced set of prefixes, ta’ and waw occur in Classical Arabic as particles as oath. 
For example tallah (‘by Allah’) in verse (37:56): qala tallahi in kidtta laturdmi 
(uifjd u! - ‘He will say, “By Allah, you almost mined me.’”. 

5.7.3 Prefixed lam Particles 

The prefix lam (J) has four uses including its use as preposition. POS:EMPH is 
used for the emphatic prefix (-^j^l fl), such as (4:66:23) lakana ( - ‘surely it 
would have been’. POSTMPV is used for the imperative prefix (j*VI fl) which 
precedes imperfect verbs placing them into the jussive mood, such as (106:3:1): 
falya’budii (I fP-P) - ‘so let them worship’. The prefix lam also occurs as a 
particle of purpose (J4 *j1I fi) tagged as POS:PRP. In this construction, the particle 
introduces a subordinate clause and places the following verb into the subjunctive 
mood, such as (72:17:1) linaftinahum ( ,4 v^ ) - ‘that we might test them’. 



5.7.4 Coordinating and Subordinating Conjunctions 

In traditional grammar, coordinating conjunctions 6 j^) are particles that 

connect two words or phrases, and are tagged as POS:CONJ. The prefixed particle 

waw (j) used in its conjunctive sense (‘and’) is the most common coordinating 

% 

conjunction. Independent coordinating conjunctions include thumma (p) ‘then’, as 
well as aw O') and am (f'), usually translated as ’or’. Subordinating conjunctions 
are tagged as POS:SUB. In Classical Arabic, the most common subordinating 
conjunction (ls ^) is one sense of the particle an (u'X usually translated as 
‘that’. Syntactically, particles tagged as POS:SUB introduce subordinate clauses. 
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5.7.5 Other Particles 

In addition to the part-of-speech tags described in the preceding sections, a further 
27 tags are used for other particles (the fourth subclass in Table 5.1, page 88). 
Some of these particles appear only in Classical and not Modem Arabic such as 
the prefixed hamza of equalization #j*A), tagged as POS:EQ. Historically, 

grammarians such as Ibn Hisham provided detailed analysis of Arabic particles 
(Gully, 1995). Based on traditional sources, the Quranic Arabic Corpus tagset is 
used to classify particles according to both syntactic and semantic criteria. 

Syntactically, traditional Arabic grammar describes the rules that determine the 
way in which particles modify the inflection of surrounding words. An example is 
the vocative particles (*bj ^), tagged as POS:VOC. These precede nouns and 
place them into the nominative or accusative case according to syntactic context 
and the nature of the individuals being addressed. Similarly, exceptive particles 
QAU obi) tagged as POS:EXP place nouns into the accusative case depending on 
contextual negation and ellipsis (Ansari, 2000; Jones, 2005). Another example of 
the syntactic classification of particles is the frequently occurring accusative 
particles (, harf nasb ), tagged as POS:ACC. In traditional Arabic grammar, a group 
of accusative particles known as inna wa akhwatuha (IA'jA'j u 1 ) are considered to 
be verb-like (cMb ^ j^), as they appear in syntactic constructions similar to 

verbs. Like the verb kana (u^), these particles take a subject and a predicate. 
However, they differ from verbs syntactically by placing their subjects (u' into 
the accusative case, and their objects (u' jA) into the nominative case. 

Other particles are classified on semantic grounds. These include the negative 
particles (<jA tagged as POS:NEG, prohibition particles (yrA ^j^) tagged as 
POS:PRO and interrogative particles (AA- 1 ^j*-) tagged as POSTNTG. The tag 
POS:SUP is used for supplemental particles (Aj which occur infrequently 
in the Quran. Grammarians consider these particles to supplement an existing 
sentence. Although they do not generally add extra meaning, they often make a 
sentence sound better when recited aloud, improving a verse’s prosodic balance 
(Wohaibi, 2001). 
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5.8 Morphological Features 

In addition to part-of-speech tagging, morphological segments are annotated with 
multiple feature-value pairs encoded as a sequence of feature tags. Table 5.4 
(overleaf) summarizes the feature tags used in the corpus. 

5.8.1 Prefixes 

During morphological segmentation, word-forms are segmented into prefixes, 
stems and suffixes. Prefix features are annotated using the notation X:C+ where X 
is the prefixed particle and C is its part-of-speech tag. For example, f:CONJ+ is 
used for words with the particle fa’ ( i - i ) prefixed as a coordinating conjunction 
elill). The notation X+ is used for prefixes that belong to only a single part- 
of-speech, such as the prefix feature A1+ for the determiner al ff. 

5.8.2 Suffixes 

Two suffix features are annotated using the notation +X. The first is the vocative 
suffix +VOC. This is only used with the word allah to produce the vocative word- 
form allahumma (fi-Ul) that occurs several times in the Quran. The second suffix 
tag is +n:EMPH, used to denote an emphatic suffixed letter nun (jjSjjII ju). The 
compound PRON: tag is used for suffixed pronouns (J~^> in combination 
with person, gender and number features. For example, PRON:3MS represents a 
suffixed pronoun inflected for the third person masculine singular. 

5.8.3 Classification Features 

In addition to the part-of-speech tag (formally considered a feature) a further three 
features are used to classify words. ROOT: and LEM: indicate roots and lemmas, 
specified using Buckwalter transliteration. For example LE M:kitaAb for the 
lemma kitab (s4&). The SP: feature is used to group words with a special syntactic 
function in traditional grammar. It is used for kana wa akhwatuha (W^'jA'j u^), 
kada wa akhwatuha (V'jA'j dS) and inna wa akhwatuha (W-j'jA'j u')- 
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Type 


Category 


Tag 


Description 


Prefixes 


Letter alif as a 
prefixed particle 


A:INTG+ 


Interrogative alif(?h&^\ ° A*) 


A:EQ+ 


Equalization aliflfij^ ' “ A*) 


Letter waw as a 
prefixed particle 


w:CONJ+ 


Conjunction waw (<iLL. jl jll) 


w:REM+ 


Resumption waw (LaLm^l f jll) 


w:CIRC+ 


Circumstantial waw (JL- j M 


w:SUP+ 


Supplemental waw (»AI j j' .4') 


w:P+ 


Preposition waw ( ^ j M 


w:COM+ 


Comitative waw (L*-*!' j' j) 


Letter fa ’ as a 
prefixed particle 


f:CONJ+ 


Conjunction fa ’ (<iLL- tlill) 


f:REM+ 


Resumption fa ’ (GaLSLJ tlill) 


f:SUP+ 


Supplemental fa ’ (»AI j 


f:RSLT+ 


Result fa ’ (L m'ja A j Aill) 


f:CAUS+ 


Cause fa’ (Lm“ ^') 


Letter lam as a 
prefixed particle 


1:P+ 


Preposition lam ^ 


FEMPH+ 


Emphasis lam (a44' fl) 


1:PRP+ 


Purpose lam (>«V 1 fl) 


1:IMPV+ 


Imperative lam pV) 


Other prefixes 


A1+ 


Determiner al pV) 


bi+ 


Preposition ba’ j4) 


ka+ 


Preposition keif (ja j=~) 


ta+ 


Preposition ta ’ ( « j^) 


sa+ 


Future particle sin ( i-i j^) 


ya+ 


Vocative particle ya ’ »Li) 


ha+ 


Vocative particle ha’ »Li) 


Core 

Features 


Classification 

features 


POS 


Part-of-speech 


LEM: 


Lemma 


ROOT: 


Root ( jAL 


SP: 


Special group (e.g. 4^' A-' j CP-) 


Verbal features 


Form 


I to XII (ujj) 


Aspect 


Perfect, imperfect or imperative 


Mood 


Indicative, subjunctive or jussive 


Voice 


Active (p jL-9 or passive ( J ja=^) 


Nominal features 


Derivation 


Participle or verbal noun 


State 


Definite ( ka or indefinite (s P) 


Case 


Nominative, accusative or genitive 


Phi features 


Person 


First, second or third (-LaVI) 


Gender 


Masculine or feminine (LAA') 


Number 


Singular, dual or plural (a»JI) 


Suffixes 


Suffix features 


+VOC 


Vocative suffix (used for ps4l') 


+n:EMPH 


Emphasis nun (a44' uA) 


PRON: 


Pronoun suffix ( 



Table 5.4: Morphological feature tags for Classical Arabic. 
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5.8.4 Phi Features 

The phi-features for Classical Arabic are person, gender and number, and are 
annotated using a compound tag. For example, 3MS represents third person 
masculine singular. The values for the person feature are first person (^JSidl), 
second person and third person (<-_uLUI). Gender (cAM) is a complex topic 

in Arabic and words may have different values for semantic, morphemic and 
grammatical gender. In the corpus, grammatical gender is tagged, as this is the 
most useful type of gender for syntactic annotation. 

5.8.5 Verbal Features 

The features aspect, mood, voice and form apply to verbs and their derivatives: 
active and passive participles and verbal nouns. In Arabic grammar, aspect is 
closely related to but distinct from tense. The aspects tags are PERF for perfect 
(,jjaU <J*i), IMPF for imperfect (£_ J*a) and IMPV for imperative (j4 J*i). The 
mood tags are IND for indicative (£jaj*), SUBJ for subjunctive and JUS 

for jussive Voice is tagged as either ACT for active yrV) or PASS 

for passive gA4)- Verb forms are tagged using roman numerals (I to IX), a 

convention introduced in Western works describing traditional Arabic grammar 
(Haywood and Nahmad, 1990; Wright, 2007). 

5.8.6 Nominal Features 

In Arabic, nominals may be in a definite (<i j**) or indefinite (» Ad state. These are 
tagged using the features DEF and INDEF respectively. Nominals derived from 
verbs are tagged using a derivation feature. The possible values are ACT PCPL 
for the active participle (Jc-li <~J), PASS PCPL for the passive participle ( Jj*i« ^') 
and VN for verbal nouns (j±***). In various linguistic constructions, nominals 
with these derivation tags function similarly to verbs. Syntactically, nominals are 
also found in one of three cases: NOM for the nominative case (£ A _>»), ACC for 
the accusative case (<_j^ai«) and GEN for the genitive case (jjj=^). 
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5.9 Segmentation Rules 

A segment er is a computational component that divides words into segments. The 
segmenter developed for the Quranic Arabic Corpus splits words using annotated 
morphological features. For example, a word tagged as w:CONJ+ POS:N will be 
divided into the prefixed letter waw followed by the remaining letters as a stem. 
Segmentation for the Quran is challenging due to the Uthmani script’s complex 
orthography with multiple possible forms for prefixes and suffixes as well as the 
presence of zero-length morphological segments. Table 5.5 below summarizes the 
morphological segmentation rules used in the corpus: 



Type 


Feature 


Segmentation 


Example 




w:CONJ+, ... 


Single letter particles 


(5:15:22) 4^j 




a/i/ - prefixes 


Single letter alif 


(21:36:9) 




Single letter hamza 


(56:59:1) ^ 


Prefixes 


ya+ or ha+ 


Single letter vocative 


(20:94:2) 


Two letter vocative 


(20:36:5) 






Two letter determiner 


(2:2) 424521 




A1+ 


Single letter after lam 


(16:69:18) 






Elided letter alif 


(26:176:3) *£& 


Stems 


POS: 


Single stem 


(67:1:3) 


Two stems 


(15:32:5)^1 




+VOC 


Single letter suffix 


(10:10:4) fi-W 




+n:EMPH 


Emphatic letter nun 


(3:188:2)04^ 




Emphatic letter alif 


(12:32:17) 




Verb subjects 


Subject pronoun 


(1:7:3) 


Suffixes 


Subject with object 


(18:76:3)41^ 






Elided (zero-length) 


(3:35:5) 4-0 




PRON: 


Single object 


(38:20:2) 




Two objects 


(8:43:2) 






Two objects and subject 


(33:37:31) 



Table 5.5: Morphological segmentation rules for Classical Arabic. 
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